Comments/Complains:
Sources & Acknowledgmets:
Description of the Course
This machine learning course is designed for those who have a solid basis in R and statistics, but are complete beginners with machine learning. After a broad overview of the discipline’s most common techniques and applications, you’ll gain more insight into the assessment and training of different machine learning models. The rest of the course is dedicated to a first reconnaissance with three of the most basic machine learning tasks: classification, regression and clustering.
In this first chapter, you get your first intro to machine learning. After learning the true fundamentals of machine learning, you’ll experiment with the techniques that are explained in more detail in future chapters.
At a basic level Machine Learning (ML) explores the construction and use of algorithms that we can learn from data. The “machine” is actually ready to learn if it has the ability to improve performance when it receives more information. This comes in the form of observations or particular instances of a problem solved before.
The presence of input Knowledge (or simply … data) is a very important concept in ML. Typically, this is the available dataset, containing a set of observations that each has a well-defined number of variables, often called features. An example is given in the table below:
The typical example for importing this dataset to R as a data frame is shown here:
squares <- data.frame(
size = c("small", "big", "medium"), edge = c("dotted", "striped", "normal"), color = c("green", "yellow", "green"))
In a data frame observations correspond to the rows and the columns correspond to the variables. The R functions dim(), str(), and summary() provide the dimansions, structure and statistics of our dataset.
dim(squares) # Observations, #Features
## [1] 3 3
str(squares) # Structured Overview
## 'data.frame': 3 obs. of 3 variables:
## $ size : Factor w/ 3 levels "big","medium",..: 3 1 2
## $ edge : Factor w/ 3 levels "dotted","normal",..: 1 3 2
## $ color: Factor w/ 2 levels "green","yellow": 1 2 1
summary(squares) # Stats of your data
## size edge color
## big :1 dotted :1 green :2
## medium:1 normal :1 yellow:1
## small :1 striped:1
Let us formulate the problem of labeling a given unknown square based on the previous knowledge the machine gains from the squares dataset we just inserted. So the “workflow” of our classification ML algorithm would be as shown in the visualization below.
The ML forms a function that will solve the problem based on the previous knowledge we insert on how similar problems are solved. Ideally the function we then build is generally applicable, handling all kinds of different objects.
In this example determining the most occurring color or calculating the average size of our sample of squares is not Machine Learning. The main purpose of ML is to construct predictive models based on the data that will provide answers from instances of similar problems.
An example of ML process is to use regression on heights and weights of a ste of people in order to build a function that can predict the hieght of a new person given their weight:
As a first step, you want to find out some properties of the dataset with which you’ll be working. More specifically, you want to know more about the dataset’s number of observations and variables.
In this exercise, you’ll explore the iris dataset. If you want to learn more about it, you can type ?iris in the console.
Your job is to extract the number of observations and variables from iris. This dataset is readily available in R (in the datasets package that’s loaded by default). An alternative way is to insert the data from their original .csv file:
iris <- read.csv("./Data/iris.csv", header=TRUE, sep = ",")
str() and dim(). Interpret the results.str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
dim(iris)
## [1] 150 5
head() and tail() on iris to reveal the first and last observations in the iris dataset.head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
tail(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
summary() function to generate a summary of the dataset. Assess the information of the printout.summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.054 Mean :3.759 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Part of excelling at machine learning is knowing when you’re dealing with a machine learning problem in the first place. Machine learning is more than simply computing averages or performing some data manipulation. It actually involves making predictions about observations based on previous information.
Which of the following statements uses a machine learning model?
The correct answer is ‘(1) & (3)’. Both spam detection and photo recognition are common examples of classification.
The difference between basic data manipulation and machine learning is not always clear. Let’s have a look at the statements below and identify the one which is not a machine learning problem.
Given a viewer’s shopping habits, recommend a product to purchase the next time she visits your website.
This is a ML problem.
Given the symptoms of a patient, identify her illness.
This is a ML problem.
Predict the USD/EUR exchange rate for February 2016.
This is a ML problem.
Compute the mean wage of 10 employees for your company.
This is not a ML problem!
Let’s get down to a bit of coding! Your task is to examine this course’s first prediction model. You’ll be working with the Wage dataset. It contains the wage and some general information for workers in the mid-Atlantic region of the US.
There could be a relationship between a worker’s age and his wage. Older workers tend to have more experience on average than their younger counterparts, hence you could expect an increasing trend in wage as workers age. So we built a linear regression model for you, using lm(): lm_wage. This model predicts the wage of a worker based only on the worker’s age.
With this linear model lm_wage, which is built with data that contain information on workers’ age and their corresponding wage, you can predict the wage of a worker given the age of that worker. For example, suppose you want to predict the wage of a 60 year old worker. You can use the predict() function for this. This generic function takes a model as the first argument. The second argument should be some unseen observations as a data frame. predict() is then able to predict outcomes for these observations.
Note: At this point, the workings of lm() are not important, you’ll get a more comprehensive overview of regression in chapter 4.
First, if you haven’t installed the ISLR package yet, you should with the command: install.packages("ISLR").
Then you call the package, get the data, and check them with the commands:
library(ISLR)
data(Wage)
summary(Wage)
## year age sex maritl
## Min. :2003 Min. :18.00 1. Male :3000 1. Never Married: 648
## 1st Qu.:2004 1st Qu.:33.75 2. Female: 0 2. Married :2074
## Median :2006 Median :42.00 3. Widowed : 19
## Mean :2006 Mean :42.41 4. Divorced : 204
## 3rd Qu.:2008 3rd Qu.:51.00 5. Separated : 55
## Max. :2009 Max. :80.00
##
## race education region
## 1. White:2480 1. < HS Grad :268 2. Middle Atlantic :3000
## 2. Black: 293 2. HS Grad :971 1. New England : 0
## 3. Asian: 190 3. Some College :650 3. East North Central: 0
## 4. Other: 37 4. College Grad :685 4. West North Central: 0
## 5. Advanced Degree:426 5. South Atlantic : 0
## 6. East South Central: 0
## (Other) : 0
## jobclass health health_ins logwage
## 1. Industrial :1544 1. <=Good : 858 1. Yes:2083 Min. :3.000
## 2. Information:1456 2. >=Very Good:2142 2. No : 917 1st Qu.:4.447
## Median :4.653
## Mean :4.654
## 3rd Qu.:4.857
## Max. :5.763
##
## wage
## Min. : 20.09
## 1st Qu.: 85.38
## Median :104.92
## Mean :111.70
## 3rd Qu.:128.68
## Max. :318.34
##
lm_wage, which models the wage by the age variable.lm_wage <- lm(wage ~ age, data = Wage)
data.frame: See how the data frame unseen is created with a single column, age, containing a single value, 60.unseen <- data.frame(age = 60)
predict(): you have to pass the arguments lm_wage and unseen.predict(lm_wage, unseen)
## 1
## 124.1413
Based on the linear model that was estimated from the Wage dataset, you predicted the average wage for a 60 year old worker to be around 124 USD a day.
There are three classes of ML problems:
Classification
Regression
Clustering
Each refers to a broad set of topics. In this section we will make a short introduction to each of them. The last three chapters of this course go to more details for each of them.
The Goal of classification is to predict the category of new observation, based on previous knowledge, i.e., classifying objects based on previous information. The ML builds then a classifier, i.e., a function to decide the class of the unknown observations. Indicative appliclations of classification include:
Medical Diagnosis. E.g., Sick and Not Sick in a set of patients.
Animal Recognition. E.g., Dog, Cat or Horse in a set of animals.
It is important to keep in mind that
Classification techniques have Qualitative Output, and
the classes to which each observation will belong are known beforehand, i.e., they are Predefined Classes.
A regression problem is a kind of ML problem that tries to predict a continuous or quantitative value for an observation based on previous information. The input variables are called the predictors and the output the response:
In some sense regression is similar to classification, since it tries to predict the outcome based on previous knowledge, but this time we try to predict a value instead of a class of the observation. In the example of predicting the height of a person based on their weight, with regression we can fit for example a linear function to the variables for a known set of people in order to achieve our prediction. This linear function can have the form:
Some typical regression applications include problems such as:
Modeling Credit Scores based on previous Payments.
Finding the trend of a magazine Subscriptions over Time.
Estimating chances of Landing a Job based college Grades.
All these problems have two things in common.
They have Quantitative Output, and
we will always need the knowledge of Previous Input-Output Observations in order to perform a succesful regression.
In clustering we try to group objects in clusters, while making sure the clusters themselves are not similar with each other. So clustering groups objects which are
Similar within their clusters, but
Dissimilar between different clusters.
In the example of animals classification we have pictures of known animals to insert to our ML algorithm in order to classify unknown animals based on their pictures. In the case of clustering, however, we do not have the classes of animals depicted, we only have pictures of various animals, whcih we will cluster together based on their characteristics.
So, in clustering ML
we need No knowledge about the labels,
there is no right or wrong, and
there are various possible clusters, because different clustering can reveal different information.
You can solve quite a few problems using classification, regression or clustering. Which of the following questions can be answered using one of the three types of ML algorithm?
How does the exchange rate depend on the GDP?
This is a definite regression problem. Dependence of exchange rate on GDP is a time series analysis problem, the trend line is often found using Regression.
Does a document contain the handwritten letter S?
Typical classification problem. Identifying whether a letter is or isn’t present in a handwritten text is a perfect example of a classification problem. You know in advance there are two possible categories: Either a scribble is the Letter S or it is not!
How can I group supermarket products using purchase frequency?
Clustering is more suitable here, to identify potential categories out of a group of observations, without knowing what the expected categories are in advance.
Filtering spam from relevant emails is a typical machine learning task. Information such as word frequency, character frequency and the amount of capital letters can indicate whether an email is spam or not.
In the following exercise you’ll work with the dataset emails, which is loaded in your workspace (Source: UCI Machine Learning Repository). Here, several emails have been labeled by humans as spam (1) or not spam (0) and the results are found in the column spam. The considered feature in emails to predict whether it was spam or not is avg_capital_seq. It is the average amount of sequential capital letters found in each email.
In the code, you’ll find a crude spam filter already built, spam_classifier() that uses avg_capital_seq to predict whether an email is spam or not. In the function definition, it’s important to realize that x refers to avg_capital_seq. So where the avg_capital_seq is greater than 4, spam_classifier() predicts the email is spam (1), if avg_capital_seq is inclusively between 3 and 4, it predicts not spam (0), and so on. This classifier’s methodology of predicting whether an email is spam or not seems pretty random, but let’s see how it does anyways!
Your job is to inspect the emails dataset, apply spam_classifier to it, and compare the predicted labels with the true labels. If you want to learn more about writing functions, consider taking the Writing Functions in R course taught by Hadley and Charlotte Wickham.
First, we input the dataset.
emails <- read.table("./Data/emails_small.dat", header = TRUE, sep = "")
dim().dim(emails)
## [1] 13 2
spam_classifier().spam_classifier <- function(x){
prediction <- rep(NA, length(x)) # initialize prediction vector
prediction[x > 4] <- 1
prediction[x >= 3 & x <= 4] <- 0
prediction[x >= 2.2 & x < 3] <- 1
prediction[x >= 1.4 & x < 2.2] <- 0
prediction[x > 1.25 & x < 1.4] <- 1
prediction[x <= 1.25] <- 0
return(prediction) # prediction is either 0 or 1
}
avg_capital_seq column of emails to spam_classifier() to determine which emails are spam and which aren’t.spam_pred <- spam_classifier(emails$avg_capital_seq)
spam_pred == emails$spam
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
All of them where correctly classified!
It looks like spam_classifier() correctly filtered the spam 13 out of 13 times! Sadly, this classifier was made to perfectly filter all 13 examples. If you were to use it on a new set of emails, the results would be far less satisfying. In chapter 3, you’ll learn more about techniques to classify the data, but without cheating!
It’s time for you to make another prediction with regression! More precisely, you’ll analyze the number of views of your LinkedIn profile. With your growing network and your data science skills improving daily, you wonder if you can predict how often your profile will be visited in the future based on the number of days it’s been since you created your LinkedIn account.
The instructions will help you predict the number of profile views for the next 3 days, based on the views for the past 3 weeks. The linkedin vector, which contains this information, is made available in your workspace:
linkedin <- c(5, 7, 4, 9, 11, 10, 14, 17, 13, 11, 18, 17, 21, 21, 24, 23, 28, 35, 21, 27, 23)
seq() function, or simply :.days <- rep(NA, length(linkedin))
days <- seq(1, 21, by = 1) # or days <- c(1:21)
lm() function such that linkedin (number of views) is a function of days (number of days since you made your account). As an example, lm(y ~ x) builds a linear model such that y is a function of x, or more colloquially, y is based on x. Assign the resulting linear model to linkedin_lm.linkedin_lm <- lm(linkedin ~ days)
predict() and the predefined future_days data frame. Assign the result to linkedin_pred.future_days <- data.frame(days = 22:24)
linkedin_pred <- predict(linkedin_lm, future_days)
plot(linkedin ~ days, xlim = c(1, 24))
points(22:24, linkedin_pred, col = "green")
Last but not least, there’s clustering. This technique tries to group your objects. It does this without any prior knowledge of what these groups could or should look like. For clustering, the concepts of prior knowledge and unseen observations are less meaningful than for classification and regression.
In this exercise, you’ll group irises in 3 distinct clusters, based on several flower characteristics in the iris dataset. It has already been chopped up in a data frame my_iris and a vector species, as shown in the following code.
my_iris <- iris[-5]
species <- iris$Species
The clustering itself will be done with the kmeans() function. How the algorithm actually works, will be explained in the last chapter. For now, just try it out to gain some intuition!
Note: In problems that have a random aspect (like this problem with kmeans()), the set.seed() function will be used to enforce reproducibility. If you fix the seed, the random numbers that are generated (e.g. in kmeans()) are always the same.
kmeans() function.my_iris; the second argument is 3, as you want to find three clusters in my_iris. Assign the result to a new variable, kmeans_iris.kmeans_iris <- kmeans(my_iris, 3)
The actual species of the observations is stored in species. Use table() to compare it to the groups that the clustering came up with. These groups can be found in the cluster attribute of kmeans_iris.
table(kmeans_iris$cluster, species)
## species
## setosa versicolor virginica
## 1 0 2 36
## 2 0 48 14
## 3 50 0 0
plot(Petal.Length ~ Petal.Width, data = my_iris, col = kmeans_iris$cluster)
table() function shows that the groups the clustering came up with largely correspond to the actual species of the different observations.Now that you’ve tried regression, classification and clustering problems, it’s time to delve a little into the differences between these three techniques.
In Supervised Learning and try to find a function that assigns a value or class (regression or classification) to unseen observations, given a set of labeled observations. These known observations are vaulable during the training of the function.
There are other tecniques that do not need labeled observations. There are called Unsupervised Learning techniques. Such is clustering.
It is essential to assess the performance of our predictive models. In supervised learning this is done by comparing real labels of known observations to those predicted by our model. Of course in a succesful model the predictions should be similar to real observations.
In unsupervised learning measuring the performace is somewhat more difficult, because there are no real labels to compare to. There are various techniques to asses the quality of our modeling, which will be shortly explained in the next chapter.
In ML, some techniques overlap between supervised and unsupervised learning, like in the case of a data set with many unlabelled observations and few that are labeled. In this case first we perform clustering to group all the observations whcih are similar, and then we can use information about the clusters and about a few labeled observations to assgn a class to unlabelled observations. This provides more labeled observations to perform a supervised learning method. These are the cases of Semi-supervised Learning.
Previously, you used kmeans() to perform clustering on the iris dataset. Remember that you created your own copy of the dataset, and dropped the Species attribute? That’s right, you removed the labels of the observations.
In this exercise, you will use the same dataset. But instead of dropping the Species labels, you will use them to do some supervised learning using recursive partitioning! Don’t worry if you don’t know what that is yet. Recursive partitioning (a.k.a. decision trees) will be explained in Chapter 3.
str() and summary().str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.054 Mean :3.759 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
set.seed(1)
rpart() function from the rpart package. This model trains a decision tree on the iris dataset. Don’t forget to first install the package (install.packages("rpart")) if you haven’t done so.library(rpart)
tree <- rpart(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data = iris, method = "class")
unseen <- data.frame(Sepal.Length = c(5.3, 7.2),
Sepal.Width = c(2.9, 3.9),
Petal.Length = c(1.7, 5.4),
Petal.Width = c(0.8, 2.3))
predict() function with the tree model as the first argument. The second argument should be a data frame containing observations of which you want to predict the label. In this case, you can use the predefined unseen data frame. The third argument should be type = "class". Simply print out the result of this prediction step.predict(tree, unseen, type = "class")
## 1 2
## setosa virginica
## Levels: setosa versicolor virginica
In this exercise, you will group cars based on their horsepower and their weight. You can find the types of car and corresponding attributes in the cars data frame, which has been derived from the mtcars dataset. It’s available in your workspace with the command:
mtcars <- read.csv("./Data/mtcars.csv", header=TRUE, sep = ",")
Create a subset of the original dataset, and get the row names from the original set.
cars <- subset(mtcars, select = c(wt, hp))
rownames(cars) <- mtcars$X
An equivalent proceedure for selecting the cars dataset:
1. library(dplyr)
2. cars <- select(mtcars, wt,hp)
3. rownames(cars) <- mtcars$X
To cluster the different observations, you will once again use kmeans(). In short, your job is to cluster the cars in 2 groups, but don’t forget to explore the dataset first!
str() and summary().str(cars)
## 'data.frame': 32 obs. of 2 variables:
## $ wt: num 2.62 2.88 2.32 3.21 3.44 ...
## $ hp: int 110 110 93 110 175 105 245 62 95 123 ...
summary(cars)
## wt hp
## Min. :1.513 Min. : 52.0
## 1st Qu.:2.581 1st Qu.: 96.5
## Median :3.325 Median :123.0
## Mean :3.217 Mean :146.7
## 3rd Qu.:3.610 3rd Qu.:180.0
## Max. :5.424 Max. :335.0
set.seed(1)
kmeans() with two arguments to group the cars into two clusters based on the contents of the cars data frame. Assign the result to km_cars.km_cars <- kmeans(cars, 2)
km_cars$cluster
## Mazda RX4 Mazda RX4 Wag Datsun 710
## 1 1 1
## Hornet 4 Drive Hornet Sportabout Valiant
## 1 2 1
## Duster 360 Merc 240D Merc 230
## 2 1 1
## Merc 280 Merc 280C Merc 450SE
## 1 1 2
## Merc 450SL Merc 450SLC Cadillac Fleetwood
## 2 2 2
## Lincoln Continental Chrysler Imperial Fiat 128
## 2 2 1
## Honda Civic Toyota Corolla Toyota Corona
## 1 1 1
## Dodge Challenger AMC Javelin Camaro Z28
## 1 1 2
## Pontiac Firebird Fiat X1-9 Porsche 914-2
## 2 1 1
## Lotus Europa Ford Pantera L Ferrari Dino
## 1 2 2
## Maserati Bora Volvo 142E
## 2 1
In the previous exercise, you grouped the cars based on their horsepower and their weight. Now let’s have a look at the outcome!
An important part in machine learning is understanding your results. In the case of clustering, visualization is key to interpretation! One way to achieve this is by plotting the features of the cars and coloring the points based on their corresponding cluster.
In this exercise you’ll summarize your results in a comprehensive figure. The dataset cars is already available in your workspace; the code to perform the clustering is already available.
km_cars.km_cars$centers
## wt hp
## 1 2.692000 99.47368
## 2 3.984923 215.69231
Call the plot() command and color the cars based on their cluster.
Do this by setting the col argument to the cluster partitioning vector: km_cars$cluster.
Add the clusters’ centroids to your earlier plot with points().
To learn about the other parameters that have been defined for you, have a look at the graphical parameters documentation.
plot(cars, col = km_cars$cluster)
# Alternative (more general) call:
# plot(hp ~ wt, data = cars, col = km_cars$cluster)
points(km_cars$centers, pch = 22, bg = c(1, 2), cex = 2)
In this chapter we’ll learn how to assess the performance of both supervised and unsupervised learning algorithms. Also we will discuss why and how we should split our data in a training set and a test set. Finally, the concepts of bias and variance are explained.
The assessment of the quality of our predictive ML model depends on the problem and method used. First we need to define the performance of our model, which depends on the context in which we want to use the model. Three of the performance parameters – their importance being defined by the task at hand – are listed below:
The Accuracy of the model is sometimes the most important measure of performance.
The Computation time is also a good indicator of performance.
The Interpretability of our model is also essential in various cases.
Each of the three types of ML tasks (classification, regression, clustering) has its own performance measures.
In Classification systems Accuracy and Error are the essential performance measures. Basically, they reflect the number of times the systems is right or wrong. If accuracy goes up the error rate goes down. Typically, accuracy is defined as
That is why we always need to calculate also the Confusion Matrix in every classification problem. It contains rows and columns, containing all available labels. Each cell in a confusion matrix contains the frequency of instances that are classified in a certain way. For the example of binary classification with one class positive (p) and the other negative (n), i.e., or .
The upper-left corner contains the frequency of true positives, i.e., instances which are correctly classified as positive. The lower-right corner contains the frequency of true negatives, i.e., instances which are correctly classified as negative. The upper-right corner contains the frequency of false negatives, i.e., instances which are falsly classified as negative. Finally, the lower-left corner contains the frequency of false positives, i.e., instances which are wrongly classified as positive.
From the ratios in the confusion matrix the measures of accuracy can be calculated. These measures are Accuracy, Precision, and Recall. They are defined as:
The performance of regression is assessed through the Root Mean Square Error (RMSE). It is calculated as in the following formula:
where
is the actual outcome of observation ,
is predicted outcome of observation , and
is the total number of observations.
It is essentially the mean distance between the measured actual values and the values predicted by the model . If the distances between and are large the RMSE will be large, while if these distances are small, RSME will also be small.
In the case of Clastering ML methods, where we have no label information, we have to rely on the distance metric between points inside and outside the determined clusters. The performance measure in this case always comprises two elements:
The Similarity within each cluster. How points in the same cluster are alike. For a good model we want this metric to be high (very similar points within each cluster).
Similarity between the clusters. How the points in two different clusters are similar or not. For high performance of our clustering technique we want this metric to low (very disimilar clusters from one another).
Some of the measures for both metrics are shortly summarized here:
Have you ever wondered if you would have survived the Titanic disaster in 1912? Our friends from Kaggle have some historical data on this event. The titanic dataset is already available in your workspace.
In this exercise, a decision tree is learned on this dataset. The tree aims to predict whether a person would have survived the accident based on the variables Age, Sex and Pclass (travel class).
The decision the tree makes can be deemed correct or incorrect if we know what the person’s true outcome was. That is, if it’s a supervised learning problem.
Since the true fate of the passengers, Survived, is also provided in titanic, you can compare it to the prediction made by the tree. The results can be summarized in a confusion matrix. In R, you can use the table() function for this.
In this exercise, you will only focus on assessing the performance of the decision tree. In chapter 3, you will learn how to actually build a decision tree yourself.
Note: As in the previous chapter, there are functions that have a random aspect. The set.seed() function is used to enforce reproducibility. Don’t worry about it, just don’t remove it!
set.seed(1)
titanic_kaggle <- read.csv("./Data/kaggle_titanic.csv", header=TRUE)
titanic <- subset(titanic_kaggle, !is.na(Age), select = c(Survived, Pclass, Sex, Age))
str(titanic)
## 'data.frame': 714 obs. of 4 variables:
## $ Survived: int 0 1 1 1 0 0 0 1 1 1 ...
## $ Pclass : int 3 1 3 1 3 1 3 3 2 3 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 1 1 1 ...
## $ Age : num 22 38 26 35 35 54 2 27 14 4 ...
tree <- rpart(Survived ~., data = titanic, method = "class")
predict() who survived in the titanic dataset. Use tree as the first argument and titanic as the second argument. Make sure to set the type parameter to "class". Assign the result to pred.pred <- predict (tree, titanic, type = "class")
table() function. This function builds a contingency table. The first argument corresponds to the rows in the matrix and should be the Survived column of titanic: the true labels from the data. The second argument, corresponding to the columns, should be pred: the tree’s predicted labels.table(titanic$Survived, pred)
## pred
## 0 1
## 0 371 53
## 1 78 212
So to summarize, 212 out of all 265 survivors were correctly predicted to have survived. On the other hand, 371 out of the 449 deceased were correctly predicted to have perished
The confusion matrix from the last exercise provides you with the raw performance of the decision tree:
The survivors correctly predicted to have survived: true positives (TP)
The deceased who were wrongly predicted to have survived: false positives (FP)
The survivors who were wrongly predicted to have perished: false negatives (FN)
The deceased who were correctly predicted to have perished: true negatives (TN)
The confusion matrix is called conf, try this in the console for its specific values:
conf <- table(titanic$Survived, pred)
These values can be used to estimate comprehensive ratios to asses the performance of a classification algorithm. An example is the accuracy, which in this case represents the percentage of correctly predicted fates of the passengers.
Apart from accuracy, precision and recall are also key metrics to assess the results of a classification algorithm:
The confusion matrix you’ve calculated in the previous exercise is still available in your workspace as conf.
TN <- conf[1, 1] # 371
FP <- conf[1, 2] # 53
FN <- conf[2, 1] # 78
TP <- conf[2, 2] # 212
acc and print it out.acc = (TP+TN)/(TP+FP+TN+FN)
acc
## [1] 0.8165266
prec and rec. Print out both of them.prec = TP/(TP+FP)
prec
## [1] 0.8
rec = TP/(TP+FN)
rec
## [1] 0.7310345
With an accuracy of around 82%, the model only incorrectly predicted 18% of the passengers’ fates. Overall you have an acceptable, but not truly satisfying classifier. In chapter 3, you’ll improve your results!
Imagine this: you’re working at NASA and your team measured the sound pressure produced by an airplane’s wing under different settings. These settings are the frequency of the wind, the angle of the wing, and several more. The results of this experiment are listed in the air dataset (Source: UCIMLR).
air <- read.table("./Data/airfoil_self_noise.dat", header=TRUE)
Your team wants to build a model that’s able to predict the sound pressure based on these settings, instead of having to do those tedious experiments every time.
A colleague has prepared a multivariable linear regression model, fit. It takes as input the predictors: wind frequency (freq), wing’s angle (angle), and chord’s length (ch_length). The response is the sound pressure (dec). All these variables can be found in air.
Now, your job is to assess the quality of your colleague’s model by calculating the RMSE:
: predicted outcome of observation
: Number of observations
For example: if truth$col was a column with true values of a variable and pred is the prediction of that variable, the formula could be calculated in R as follows:
sqrt((1/nrow(truth)) * sum( (truth$col - pred) ^ 2))
str(air)
## 'data.frame': 1503 obs. of 6 variables:
## $ freq : int 800 1000 1250 1600 2000 2500 3150 4000 5000 6300 ...
## $ angle : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ch_length: num 0.305 0.305 0.305 0.305 0.305 ...
## $ velocity : num 71.3 71.3 71.3 71.3 71.3 71.3 71.3 71.3 71.3 71.3 ...
## $ thickness: num 0.00266 0.00266 0.00266 0.00266 0.00266 ...
## $ dec : num 126 125 126 128 127 ...
fit <- lm(dec ~ freq + angle + ch_length, data = air)
predict() function to make predictions for the observations in the air dataset. Simply pass fit to predict(); R will know what to do. Assign the result to pred.pred <- predict(fit, air)
air$dec.pred.rmse. Print out rmse.rmse <- sqrt( (1/length(air$freq)) * sum((air$dec - pred)^2) )
rmse
## [1] 5.215778
So the RMSE is given by 5… 5 what? Well 5 decibels, the unit of the sound pressure, your response variable. As a standalone number, it doesn’t tell you a whole bunch. In order to derive its meaning, it should be compared to the RMSE of a different model for the same problem, which is exactly what you’ll do in the next exercise!
In the last exercise, your team’s model had 3 predictors (input variables), but what if you included more predictors? You have the measurements on free-stream velocity, velocity and suction side displacement thickness, thickness available for use in the air dataset as well!
Adding the new variables will definitely increase the complexity of your model, but will it increase the performance? To find out, we’ll take the RMSE from the new, more complex model and compare it to that of the original model.
A colleague took your code from the previous exercise and added code that builds a new extended model, fit2! It’s your job to once again assess the performance by calculating the RMSE.
fit <- lm(dec ~ freq + angle + ch_length, data = air)
pred <- predict(fit)
rmse <- sqrt(sum( (air$dec - pred) ^ 2) / nrow(air))
rmse
## [1] 5.215778
fit2 <- lm(dec ~ freq + angle + ch_length + velocity + thickness, data = air)
predict() function to make predictions using fit2, for all values in the air dataset. Assign the resulting vector to pred2.rmse2.rmse2 and compare it with the earlier rmse. What do you conclude?pred2 <- predict(fit2)
rmse2 <- sqrt(sum( (air$dec - pred2) ^ 2) / nrow(air))
rmse2
## [1] 4.799244
Adding complexity seems to have caused the RMSE to decrease, from 5.216 to 4.799. But there’s more going on here; perhaps adding more variables to a regression always leads to a decrease of your RMSE? You’ll learn more about this in Chapter 4!
In the dataset seeds you can find various metrics such as area, perimeter and compactness for 210 seeds. (Source: UCIMLR).
seeds <- read.table("./Data/seeds.dat", header=TRUE)
However, the seeds’ labels were lost. Hence, we don’t know which metrics belong to which type of seed. What we do know, is that there were three types of seeds.
set.seed(1)
The following code groups the seeds into three clusters (km_seeds), but is it likely that these three clusters represent our seed types? Let’s find out.
km_seeds <- kmeans(seeds[1:7], 3)
There are two initial steps you could take:
Visualize the distribution of cluster assignments among two variables, for example length and compactness.
Verify if the clusters are well separated and compact. To do this, you can calculate the between and within cluster sum of squares respectively.
str(seeds)
## 'data.frame': 210 obs. of 7 variables:
## $ area : num 15.3 14.9 14.3 13.8 16.1 ...
## $ perimeter : num 14.8 14.6 14.1 13.9 15 ...
## $ compactness : num 0.871 0.881 0.905 0.895 0.903 ...
## $ length : num 5.76 5.55 5.29 5.32 5.66 ...
## $ width : num 3.31 3.33 3.34 3.38 3.56 ...
## $ asymmetry : num 2.22 1.02 2.7 2.26 1.35 ...
## $ groove_length: num 5.22 4.96 4.83 4.8 5.17 ...
plot() command by coloring the observations based on their cluster. Do this by setting the col argument equal to the cluster element of km_seeds.plot(seeds$length ~ seeds$compactness, col=km_seeds$cluster)
# alternative call:
# plot(length ~ compactness, data=seeds, col=km_seeds$cluster)
km_seeds as tot.withinss and betweenss. Is the within sum of squares substantially lower than the between sum of squares?km_seeds$tot.withinss/km_seeds$betweenss
## [1] 0.2762846
The within sum of squares is far lower than the between sum of squares, indicating the clusters are well seperated and overall compact. This is further strengthened by the plot you made, where the clusters you made were visually distinct for these two variables. It’s likely that these three clusters represent the three seed types well, even if there’s no way to truly verify this.
Here we will discuss the differences between Machine (supervised) learning and classical statistics, which boils down to Predictive power vs. Descriptive power.
Supervised Learning should have strong predictive power, i.e., our model must be able to predict the behavior of instances of unknown observations. On the other hand Classical Statisctis focuses on how good our model fits the data. The model, thus, must be able to fit, i.e., describe correctly, the data, i.e., the known observations.
By learning a Predictive Model we should not use the complete data sample to train the model. We only use the so-called Training set for training the model. The remaining part of the original dataset is called the Test set, and this will be used to evaluate the actual performance of the model.
It is essential that the Training and Test sets are disjoint, i.e., they do not share observations. In other words there are no overlapping observations between the training and test sets. The test set can then be used to check whether the model predicts correctly unseen observations, i.e., it generalizes correctly.
Let us assume that we have a Classification Problem. The variables of the problem are:
Variable : instances (observations)
Variable : features
Variable : Class labels of observations
The data frame is shown here:
For our training set we will use the first observations (colored blue in the graph above), and the remaining will be used as the test set (green observations).
For the training set we use the features and the class labels to train a model. For the test set we use the features to input to our model but (of course) not the labels. These will be predicted by the model. Their comparsion with the actual ones will be inserted in the confusion matrix and help us assess the accuracy of our model.
The concept of these two tests is only important for supervised learning. For Unsupervised Learning (i.e., clustering) the is no need for splitting the data in these two sets, becasue our data is not labeled.
The followijg diagram gives a workflow of how to assess the predictive power of a model.
We can use the performance measures discussed in the previous sections and apply them on the test set. It is important to understand that the measurement of the predictive power of a model must not be influenced by any way by the instances in the training set. So, it is important to decide how to split the sets.
Knowing which instances go to the training and which to the test set is essential. There are basic rules that need to be followed for this decision. The Training set must be chosen to be larger than Test set. A typical ratio is training/test = 3/1. This ratio is arbitary, and depends on the problem at hand. In general, though the more data we use to train the model, the better the model will be. Moreover, the test must not be set to be too small.
The Distribution of the Sets is also very important. So we must choose wisely which Elements go to the training and which to the Test Set.
For Classification, classes inside the training and test sets should have similar distributions. We should avoid class that is not available in one of the sets.
For Classification & Regression, it is always good practice to shuffle the complete dataset before splitting it.
Also, the Effect of Sampling plays an important role. Sampling the data in a certain way to compose the test set can influence the performence measures on the set.
We add Robustness to our performance measures with the use of Cross-Validation. Cross-Validation means that we use a learning algorithm to train the model multiple times, each time with a different separation between training and test sets. In n-fold cross-validation we “fold” the test set over dataset times; Eact test set is in size of the total dataset.
Let’s return to the titanic dataset for which we set up a decision tree. In exercises 2 and 3 you calculated a confusion matrix to assess the tree’s performance. However, the tree was built using the entire set of observations. Therefore, the confusion matrix doesn’t assess the predictive power of the tree. The training set and the test set were one and the same thing: this can be improved!
First, you’ll want to split the dataset into train and test sets. You’ll notice that the titanic dataset is sorted on titanic$Survived, so you’ll need to first shuffle the dataset in order to have a fair distribution of the output variable in each set.
For example, you could use the following commands to shuffle a data frame df and divide it into training and test sets with a 60/40 split between the two.
n <- nrow(df)
shuffled_df <- df[sample(n), ]
train_indices <- 1:round(0.6 * n)
train <- shuffled_df[train_indices, ]
test_indices <- (round(0.6 * n) + 1):n
test <- shuffled_df[test_indices, ]
Watch out, this is an example of how to do a 60/40 split! In the exercise you have to do a 70/30 split. However, you can use the same commands, just change the numbers!
The titanic dataset is loaded into the workspace
titanic_datacamp <- read.table("./Data/datacamp_titanic.dat", header=TRUE)
titanic_datacamp$Survived <- factor(titanic_datacamp$Survived, levels = c("1", "0"))
titanic_datacamp$Pclass <- factor(titanic_datacamp$Pclass, levels = c("1", "2", "3"))
titanic_datacamp$Sex <- factor(titanic_datacamp$Sex, levels = c("female", "male"))
titanic <- titanic_datacamp[with(titanic_datacamp, order(Survived)), ]
n <- nrow(titanic)
shuffled <- titanic[sample(n),]
1:round(0.7 * n) and the test set in (round(0.7 * n) + 1):n. The example in the exercise description above can help you!train_indices <- 1:round(0.7 * n)
test_indices <- (round(0.7 * n)+1):n
train <- shuffled[train_indices,]
test <- shuffled[test_indices,]
str(train)
## 'data.frame': 500 obs. of 4 variables:
## $ Survived: Factor w/ 2 levels "1","0": 2 1 2 2 2 2 1 1 1 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 2 3 1 3 3 2 3 1 2 3 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 2 1 1 1 2 2 1 1 ...
## $ Age : num 18 5 36 18 45 26 9 49 28 31 ...
str(test)
## 'data.frame': 214 obs. of 4 variables:
## $ Survived: Factor w/ 2 levels "1","0": 2 2 2 2 2 2 1 2 2 1 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 2 1 3 3 3 3 1 3 3 3 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 2 2 1 2 1 1 2 2 1 ...
## $ Age : num 54 54 19 37 26 29 22 28 43 26 ...
You’ve successfully split the entire dataset into a training set and a test set. Now you are ready to build a model on the training set, and to test its predictive ability on a test set.
Time to redo the model training from before. The titanic data frame is again available in your workspace. This time, however, you’ll want to build a decision tree on the training set, and next assess its predictive power on a set that has not been used for training: the test set.
The code that splits titanic up in train and test has already been included. Also, the old code that builds a decision tree on the entire set is included. Up to you to correct it and connect the dots to get a good estimate of the model’s predictive ability.
set.seed(1)
n <- nrow(titanic)
shuffled <- titanic[sample(n),]
train <- shuffled[1:round(0.7 * n),]
test <- shuffled[(round(0.7 * n) + 1):n,]
rpart(...) so that it is learned on the training set.tree <- rpart(Survived ~ Pclass + Sex + Age, data = train, method = "class")
tree <- rpart(Survived ~ ., data = train, method = "class")
predict() function with the tree model as the first argument and the correct dataset as the second argument. Set type to "class". Call the predicted vector pred. Remember that you should do the predictions on the test set.pred <- predict(tree, test, type = "class")
table() function to calculate the confusion matrix. Assign this table to conf. Construct the table with the test set’s actual values (test$Survived) as the rows and the test set’s model predicted values (pred) as columns. Finally, print out conf.conf <- table(test$Survived, pred)
conf
## pred
## 1 0
## 1 60 29
## 0 22 103
The confusion matrix reveals an accuracy of (58+102)/(58+31+23+102) = 74.76%. This is less than the 81.65% you calculated in the first section of this chapter. However, this is a much more trustworthy estimate of the model’s true predictive power.
You already did a great job in assessing the predictive performance, but let’s take it a step further: cross validation.
In this exercise, you will fold the dataset 6 times and calculate the accuracy for each fold. The mean of these accuracies forms a more robust estimation of the model’s true accuracy of predicting unseen data, because it is less dependent on the choice of training and test sets.
Note: Other performance measures, such as recall or precision, could also be used here.
The code to split the dataset correctly 6 times and build a model each time on the training set is written inside the for loop. accs is intialized as well.
Use the model to predict the values of the test set. Use predict() with three arguments: the decision tree (tree), the test set (test) and don’t forget to set type to "class". Assign the result to pred.
Make the confusion matrix using table() and assign it to conf. test$Survived should be on the rows, pred on the columns.
Fill in the statement to define accs[i]. The result should be the accuracy: the sum of the diagonal of the confusion matrix divided by the total sum of the confusion matrix.
Set random seed.
set.seed(1)
accs vectoraccs <- rep(0,6)
for (i in 1:6) {
# These indices indicate the interval of the test set
indices <- (((i-1) * round((1/6)*nrow(shuffled))) + 1):((i*round((1/6) * nrow(shuffled))))
# Exclude them from the train set
train <- shuffled[-indices,]
# Include them in the test set
test <- shuffled[indices,]
# A model is learned using each training set
tree <- rpart(Survived ~ ., train, method = "class")
# Make a prediction on the test set using tree
pred <- predict(tree, test, type="class")
# Assign the confusion matrix to conf
conf <- table(test$Survived,pred)
# Assign the accuracy of this model to the ith index in accs
accs[i] <- sum(diag(conf))/sum(conf)
}
mean(accs)
## [1] 0.7857143
This estimate will be a more robust measure of your accuracy. It will be less susceptible to the randomness of splitting the dataset.
In this section we discuss how splitting the data into train/test sets affects the accuracy of our models, and the concepts of Over- and Under-fitting our data. We first introduce two essential elements in ML:
Bias, and
Variance.
These concepts reveal the key-challenge of ML. So it is important that we understand them. Prediction* is the main goal of supervised learning techniques. Basically, the prediction error** of a model can be splitted into two components: Reducible and irreducible error.
The Irreducible Error can be seen as Noise, i.e., irregularities in the data. It should not be minimized when training a model.
The Reducible Error is the most interesting to ML. It is the error due to an unfit model. This is the error we seek to minimize (the lower the error, the better the model). It can be split into Error due to Bias and Error due to Variance.
It is the error we make because of the assumptions we make when we specify the Learning Algorithm.
It is the difference between Predicitions made by models using this particular algorithm and the Truth values.
In the example below, suppose we want to model the dataset plotted with red dots, which are originated with a quadratic function (green line), and we assume that the model that fits best is linear, so we apply a linear regression method. The derived model is plotted with the blue line that obviously does not fit the data well. Our error here due to bias is high, because we restrict our model. It is, thus, not able to fit the more complex quadratic shape of our data.
It is the Prediction error caused by the sampling of the dataset (training set).
The model with high variance fits the training set very closely.
In the same example of the quadratic data as before, we now assume a polynomial model that fits perfectly on the dataset, as shown with the purple line in the graph below. This algorithm has a very low bias with few restrictions. However, it has very high variance, because between the fitting points the fitted polynomial is all over the place, and so if we slightly change the values of the training set, the fit changes completely, and so will the predictions for a subsequent testing stage.
It is clear now that low bias, i.e., not restricting your learning model and allowing it to fit perfectly, leads to high variance. On the other hand, low variance, i.e., model less sensitive to the training set, leads to high bias, i.e., more restrictive model. This is called the Bias-Variance Tradeoff.
Concerning the concept of Overfitting, we should bare in mind that the accuracy of the model depends heavily on to which data belong to the training set due to the splitting of the data into train/test sets. Basically, with high variance we will fit the training set “too good”. This is called overfitting, and we can say that the model is too specific. On the other hand, Underfitting means that we restrict the model too much. The bias of the model is too high. We can say that the model is too general.
As example let us consider the case where we need to decide whether an e-mail is spam or not. We have a training set with many e-mails and we extract two features out of each e-mail: (1) the number of capital letters and (2) the number of exclamation marks. We suppose that e-mails with a lot of capital letters or exclamation marks are spams, while the others are not. However, there is an exception of an e-mail with 50 capital letters and 30 exclamation marks, which is not a spam. Using these data to train a model, if we consider this exceptional case as a condition for accepting the e-mail as no spam, we overfit the data and we build a model which is too specific. An underfit model could be one that just marks all e-mails with 10 or more capital letters as spam, and the others as no spam. This is a too general model.
In Chapter 1 we designed a crude spam filter, called spam_classifier(). It filters spam based on the average sequential use of capital letters (avg_capital_seq) to decide whether an email was spam (1) or not (0).
You may recall that we cheated and it perfectly filtered the spam. However, the set (emails_small) you used to test your classifier was only a small fraction of the entire dataset emails_full (Source: UCIMLR).
emails_full <- read.table("./Data/emails_full.dat", header = T)
emails_small <- read.table("./Data/emails_small.dat", header = T)
# Determining the spam variable as of type "factor" with levels 0 and 1.
emails_small$spam <- factor(emails_small$spam, levels = c("1", "0"))
emails_full$spam <- factor(emails_full$spam, levels = c("1", "0"))
Your job is to verify whether the spam_classifier() that was built for you generalizes to the entire set of emails. The accuracy for the set emails_small was equal to 1. Is the accuracy for the entire set emails_full substantially lower?
spam_classifier <- function(x){
prediction <- rep(NA, length(x)) # initialize prediction vector
prediction[x > 4] <- 1
prediction[x >= 3 & x <= 4] <- 0
prediction[x >= 2.2 & x < 3] <- 1
prediction[x >= 1.4 & x < 2.2] <- 0
prediction[x > 1.25 & x < 1.4] <- 1
prediction[x <= 1.25] <- 0
return(factor(prediction, levels = c("1", "0"))) # prediction is either 0 or 1
}
spam_classifier() on the avg_capital_seq variable in emails_full and save the results in pred_full.pred_full <- spam_classifier(emails_full$avg_capital_seq)
table(): conf_full. Put the true labels found in emails_full$spam in the rows.conf_full <- table(emails_full$spam, pred_full)
conf_full to calculate the accuracy: acc_full. The functions diag() and sum() will help. Print out the result.acc_full <- sum(diag(conf_full))/sum(conf_full)
acc_full
## [1] 0.6561617
This hard-coded classifier gave you an accuracy of around 65% on the full dataset, which is way worse than the 100% you had on the small dataset back in chapter 1. Hence, the classifier does not generalize well at all!
Based on the above exercise, the spam_classifier() of Chapter 1 is bogus. It simply overfits on the emails_small set and, as a result, doesn’t generalize to larger datasets such as emails_full.
So let’s try something else. On average, emails with a high frequency of sequential capital letters are spam. Let us try to simply filter spam based on one threshold for avg_capital_seq.
For example, we could filter all emails with avg_capital_seq as spam. By doing this, we increase the interpretability of the classifier and restrict its complexity. However, this increases the bias, i.e. the error due to restricting your model.
Our job is to simplify the rules of spam_classifier and calculate the accuracy for the full set emails_full. Next, compare it to that of the small set emails_small, which is coded for you. Does the model generalize now?
Simplify the rules of the spam_classifier.
Emails with an avg_capital_seq strictly longer than 4 are spam (labeled with 1), all others are seen as no spam (0).
The all-knowing classifier that has been already “learned”.
spam_classifier <- function(x){
prediction <- rep(NA, length(x))
prediction[x > 4] <- 1
prediction[x >= 3 & x <= 4] <- 0
prediction[x >= 2.2 & x < 3] <- 1
prediction[x >= 1.4 & x < 2.2] <- 0
prediction[x > 1.25 & x < 1.4] <- 1
prediction[x <= 1.25] <- 0
return(factor(prediction, levels = c("1", "0")))
}
spam_classifier <- function(x){
prediction <- rep(NA, length(x))
prediction[x > 4] <- 1
prediction[x <= 4] <- 0
return(factor(prediction, levels = c("1", "0")))
}
conf_small and acc_small. conf_small and acc_small have been calculated for you like this:conf_small <- table(emails_small$spam, spam_classifier(emails_small$avg_capital_seq))
acc_small <- sum(diag(conf_small)) / sum(conf_small)
acc_small
## [1] 0.7692308
emails_full$spam in the rows and the predicted spam values in the columns. Assign to conf_full.conf_full <- table(emails_full$spam, spam_classifier(emails_full$avg_capital_seq))
conf_full to calculate the accuracy. Assign this value to acc_full and print it out.acc_full = sum(diag(conf_full)) / sum(conf_full)
acc_full
## [1] 0.7259291
Before, acc_small and acc_full were 100% and 65%, respectively; what do you conclude?
Indeed! Your model no longer fits the small dataset perfectly but it fits the big dataset better. You increased the bias on the model and caused it to generalize better over the complete dataset. While the first classifier overfits the data, an accuracy of 73% is far from satisfying for a spam filter. Maybe you can provide us with a better model in a later exercise!
Which of the following models do you think would have the highest interpretability?
Remember that a high interpretability usually implies a high bias.
It can help to try to describe what the model has to do to classify instances.
Usually if you are able to describe what all aspects of a model do and what they mean, the model is quite interpretable.
We used a high-degree polynomial. That answer is not correct. Do you think you could explain why the polynomial behaves like it does? Wouldn’t it be easier if you just had to explain the slope in a linear model, for example.
It uses one threshold on the height attribute to make its prediction.
True! This is a very simple and general classification.
Note that this would not neccesarily be a good model.
It uses attributes such as specific word count, character count, and so on.
Incorrect! Allowing a learning algorithm to use a lot of attributes will decrease the bias on the model. This will generally increase the complexity.
False. Almost correct. Using few attributes will reduce the complexity, but adding relationships between these attributes will increase the complexity. There is a simpler model in the list, try to find it!
You’ll gradually take your first steps to correctly perform classification, one of the most important tasks in machine learning today. By the end of this chapter, you’ll be able to learn and build a decision tree and to classify unseen observations with k-Nearest Neighbors.
The Task of Classification is to automatically assign class (classify) observations with certain features.
The classification model assigns a class to new observations based on their features, using previous observations.
A Binary Classification problem is a problem with two possible outcomes (Two Classes). Multiclass (Multiple Class) Classification takes place when there are more than two possible classes.
An example of dataset for classification is a set of persons. Each person can have three features (attributes like, e.g., age, weight or income) and a class (something like happy or unhappy for binary classification problem, or happy, staisfied or not happy for multicalss classification).
Features in a classification problem can be numerical (represented by a number, i.e., height, weight, etc) or categorical (i.e., with finite amount of possible values, e.g., travel class, working status, …)
Let us consider the example, where we want to classify a patient as “sick” or “healthy” (not sick). An intuitive way to classify is by asking questions. For example the first question could be if the patient is young or old?
The next question depends on the answer of the previous question.
If the patient is old: Has the patient smoked more than 10 years?
If the patient is young: Is the patient vaccinated against measles?
and so on …
We can represent these questions as a tree:
How a tree is being defined? An example of a tree is shown below:
where
Nodes are called the connecting points B and C
Edges are the connecting branches (between A and B, and A and C)
The Start of the tree is called The Root (A)
The Ends of the tree are called The Leafs (D, E, F, G)
Nodes are splitted into Children. For example nodes D, E, F, G are children of B and C and grandchildren of A.
The opposite relation is a parent relation.
The questions on the nodes of the tree are queries on the features of the data:
In this tree, the age feature is numeric and the question splits the sample in older/younger than 18 years. ( or ). A categorical feature can be a test on its self. For example, feature "travel_class" has three possible outcomes: coach, business, first. So, the node “travel_class” will have three children.
For an observation of a patient 40 years old, who is vaccinated and never smoked in his life, the above tree leads us to the no branch because he is older than 18 years. Then since he is not a smoker, we take again the no branch, which leads us to the leaf node, upon which we can make the prediction. The outcome is not sick.
In order to build the decisoon tree, we need to learn the decision tree from a training set. We need to come up with queries (feature tests) for each node in the tree. We start by splitting the Training set into two parts (binary test) at the root node. E.g., age: or . One child node will contain all instances that are TRUE, and the other child node will contain all instances that are FALSE. Each of them is a part of the training set. Each child node then splits again the resulting training subsets, and this process continues until the leaf nodes contain a small portion of the training set.
The goal of learning a decision tree is to end up with leafs that are pure, i.e., leafs contain observations of a one particular class. In practice this is almost never possible due to noise in the data. We use the learned tree to classify new instances. When classifying new instances we end up in a leaf. The class we assign to an instance is typically the class of the majority of the training instances in that leaf.
A learning algorithm that builds a decision tree on training data at each node has to iterate over a number of possible tests on specific features and choose that that gives the best split on classes. In essence it all comes down to two parts:
Make a list of possible feature tests.
Choose the best one
For categorical features we can use feature tests which haven’t been used yet. For numerical features is more complex. Here we propose some features but also some threshold to perform our splitting on.
The next step is to choose the best test to split the training data on. There are several splitting criteria to decide which test is the best to use. One of the most commonly used is Information gain.
Information gain is closely related to entropy (in the context of information theory and not of physics).
Information gain defines how much information we gain about the training instances in our subgroups when we perform a split based on the feature test. If the test leads to nicely devided classes, the information gain is high. If the test leads to scrambled subsets the information gain is low. The test that provides the highest information gain will be chosen.
The number of nodes in a decision tree influences the chance on overfit. If we restrict the size of our decision tree the bias of our model is higher. This will effectively derease the chance of overfitting. However, pruning a decision tree will not alwaysincrease the model’s accuracy. Most of decision tree algorithms in R generally use a pretty good stopping criterion, i.e., crietrion for causing the code to stop splitting nodes.
As a big fan of shipwrecks, you decide to go to your local library and look up data about Titanic passengers. You find a data set of 714 passengers, and store it in the titanic data frame (Source: Kaggle). Each passenger has a set of features - Pclass, Sex and Age - and is labeled as survived (1) or perished (0) in the Survived column.
To test your classification skills, you can build a decision tree that uses a person’s age, gender, and travel class to predict whether or not they survived the Titanic. The titanic data frame has already been divided into training and test sets (named train and test).
The titanic dataset is loaded into the workspace and splitted into training and test sets as described in section “Split the Sets”. The process is here repeated for clarity and reproductability
titanic datasettitanic_datacamp <- read.table("./Data/datacamp_titanic.dat", header=TRUE)
titanic_datacamp$Survived <- factor(titanic_datacamp$Survived, levels = c("1", "0"))
titanic_datacamp$Pclass <- factor(titanic_datacamp$Pclass, levels = c("1", "2", "3"))
titanic_datacamp$Sex <- factor(titanic_datacamp$Sex, levels = c("female", "male"))
titanic <- titanic_datacamp[with(titanic_datacamp, order(Survived)), ]
n <- nrow(titanic)
shuffled <- titanic[sample(n),]
train_indices <- 1:round(0.7 * n)
test_indices <- (round(0.7 * n)+1):n
train <- shuffled[train_indices,]
test <- shuffled[test_indices,]
train and test setsstr(train)
## 'data.frame': 500 obs. of 4 variables:
## $ Survived: Factor w/ 2 levels "1","0": 2 2 1 1 2 2 2 1 1 1 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 2 2 3 3 3 2 3 2 2 3 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 2 2 2 2 1 1 1 ...
## $ Age : num 46 27 22 45 24 18 42 21 54 27 ...
str(test)
## 'data.frame': 214 obs. of 4 variables:
## $ Survived: Factor w/ 2 levels "1","0": 2 2 1 2 2 1 1 2 1 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 2 1 2 3 1 2 3 2 3 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 2 1 2 2 1 1 2 1 2 ...
## $ Age : num 65 47 30 29 19 18 8 23.5 19 20 ...
In this exercise, you’ll need train to build a decision tree. You can use the rpart() function of the rpart package for this. Behind the scenes, it performs the two following steps:
Finally, a fancy plot can help you interpret the tree. You will need the rattle, rpart.plot, and RColorBrewer packages to display this.
Note: In problems that have a random aspect, the set.seed() function is used to enforce reproducibility. So, use it in order to allow to others to reproduce your results.
set.seed(1)
library(). Load the rpart, rattle, rpart.plot and RColorBrewer package. Don’t forget you have to install any package you haven’t yet installed!library(rpart)
library(rpart.plot)
library(rattle)
library(RColorBrewer)
rpart() to learn a tree model and assign the result to tree. You should use three arguments:
"Survived ~ .". This represents the function you’re trying to learn. We’re trying to predict the Survived column, given all other columns (writing "." is the same as writing "Pclass + Sex + Age" in this formula).train set."class" to tell rpart this is a classification problem.treetree <- rpart(Survived ~ ., data = train, method = "class")
fancyRpartPlot(). This function accepts the tree model as an argument.fancyRpartPlot(tree)
In the previous exercise you made a fancy plot of the tree that you’ve learned on the training set. Have another look at the close-up of the example node below.
Remember how in the introduction we saw that a tree is learned by separating the training set step-by-step? In an ideal world, the separations lead to subsets that all have the same class. In reality, however, each division will contain both positive and negative training observations. In the example node, 75% of the training instances are positive and 25% are negative. The majority class thus is positive, or 1, which is signaled by the number 1 on top. The 39% bit tells you which percentage of the entire training set passes through this particular node. On each tree level, these percentages thus sum up to 100%. Finally, the Pclass = 1,2 bit specifies the feature test on which this node will be separated next. If the test comes out positive, the left branch is taken; if it’s negative, the right branch is taken.
Now that you can interpret the tree, can you tell which of the following statements is correct?
The majority class of the root node is positive, denoting survival.
Incorrect. 59% of the observations in the top node are negative.
The feature test that follows when the Sex is not female, is based on a categorical variable.
Incorrect. The feature test Age is based on a numerical variable.
The tree will predict that female passengers in class 3 will not survive, although it’s close.
Correct. 25% of the females travelled in class 3, and 42% of them will survive (against 58% that won’t).
The leftmost leaf is very impure, as the vast majority of the training instances in this leaf are positive.
Incorrect. On the contrary. In the leftmost leaf 94% of the training instances are positive, which denotes very high purity!
The previous learning step involved proposing different tests on which to split nodes and then to select the best tests using an appropriate splitting criterion. You were spared from all the implementation hassles that come with that: the rpart() function did all of that for you.
Now you are going to classify the instances that are in the test set. As before, the data frames titanic, train and test are available in your workspace. You’ll only want to work with the test set, though.
set.seed(1)
library(rpart)
tree <- rpart(Survived ~ ., train, method = "class")
tree to predict the labels of the test set with the predict() function; store the resulting prediction in pred.pred <- predict(tree, test, type = "class")
conf, of your predictions on the test set.test$Survived, should be on the rows.conf <- table(test$Survived, pred)
sum(diag(conf))/sum(conf)
## [1] 0.7803738
A like-minded shipwreck fanatic is also doing some Titanic predictions. He passes you some code that builds another decision tree model. The resulting model, tree, seems to work well but it’s pretty hard to interpret. Generally speaking, the harder it is to interpret the model, the more likely you’re overfitting on the training data.
You decide to prune his tree. This implies that you’re increasing the bias, as you’re restricting the amount of detail your tree can model. Finally, you’ll plot this pruned tree to compare it to the previous one.
All packages are pre-loaded, as is the data
A model, tree is already coded.
set.seed(1)
tree <- rpart(Survived ~ ., train, method = "class", control = rpart.control(cp=0.00001))
fancyRpartPlot() to plot it. What do you think?fancyRpartPlot(tree)
prune() method to shrink tree to a more compact tree, pruned. Also specify the cp argument to be 0.01. This is a complexity parameter. It basically tells the algorithm to remove node splits that do not sufficiently decrease the impurity.pruned <- prune(tree, cp = 0.01)
fancyRpartPlot(pruned)
Great! Another way to check if you overfit your model is by comparing the accuracy on the training set with the accuracy on the test set. You’d see that the difference between those two is smaller for the simpler tree.
You can also set the cp argument while learning the tree with rpart() using rpart.control.
A 25-year old man was in first class on the Titanic. Using the tree model we build above, it’s your task to predict whether he survived the Titanic accident or not.
Remember: 1 corresponds to survived, 0 corresponds to deceased.
The model predicts that the man perished in the accident.
Incorrect. You should start at the root of the tree and use the given features to navigate down to a leaf in the tree.
The model is certain that the man survived the accident.
Incorrect. The model may predict the man will survive, but how certain are you about this prediction? Unless he was your grandfather, it’s best you take a look at the distribution of classes in the leaf you end up in.
The model predicts that the man most probably survived the accident.
Correct! The model will predict that the man survives the disaster. However if you take a look of the distribution of classes in the leaf, you see that the model does not split the classes perfectly, so there’s no way to be absolutely sure.
Classifying an e-mail as spam or not. Do you remember the spam filters we built and tested in chapter 1 and 2? Well, it’s time to make the filter more serious! We will use the original UCI repository dataset, where some relevant data for every email are included. This extra information, such as word and character frequencies, will help filter the spam. All of these can be found in the emails dataset inserted below. Also, the training train and testing test sets (with 70/30 precentage) are been built from it.
emails <- read.csv("./Data/emails_uci.csv", header=TRUE, sep = ",")
# Shuffled the observations of the emails dataset and store the result in shuffled.
n <- nrow(emails)
shuffled <- emails[sample(n),]
# Split the dataset into a train set, and a test set with a 70/30 split.
train_indices <- 1:round(0.7 * n)
test_indices <- (round(0.7 * n)+1):n
train <- shuffled[train_indices,]
test <- shuffled[test_indices,]
In this exercise, you’ll build two decision trees based on different splitting criteria. We’ve learned that the higher the information gain when you split, the better. However, the standard splitting criterion of rpart() is the Gini impurity.
It is up to you now to compare the information gain criterion with the Gini impurity criterion: how do the accuracy and resulting trees differ?
First, set the random seed.
set.seed(1)
tree_g, pred_g, conf_g and acc_g. Here, the tree was trained with the Gini impurity criterion, which rpart() uses by default.tree_g <- rpart(spam ~ ., train, method = "class")
# more specifically:
# tree_g <- rpart(spam ~ ., train, method = "class", parms = list(split = "gini"))
pred_g <- predict(tree_g, test, type = "class")
conf_g <- table(test$spam, pred_g)
acc_g <- sum(diag(conf_g)) / sum(conf_g)
rpart() function in the first line of code so that it will split using the information gain criterion. It is coded as “information”. The code that calculates pred_i, conf_i and acc_i are defined as before.tree_i <- rpart(spam ~ ., train, method = "class", parms = list(split = "information"))
pred_i <- predict(tree_i, test, type = "class")
conf_i <- table(test$spam, pred_i)
acc_i <- sum(diag(conf_i)) / sum(conf_i)
tree_g and tree_i using fancyRpartPlot().fancyRpartPlot(tree_g)
fancyRpartPlot(tree_i)
acc_g
## [1] 0.8992754
acc_i
## [1] 0.9086957
As you can see, using a different splitting criterion can influence the resulting model of your learning algorithm. However, the resulting trees are quite similar. The same variables are often present in both trees and the accuracy on the test set is comparable: 89.9% and 90.9%.
With decission trees you use a training set to build a model to use in order to make predictions about an unseen set of observations.
Another way to make predictions is based on the Instance-based learning.
With Instance-based learning you basically save your training set in memory and there is no notion of a training model (as in the decision tree).
Then the learner compares the unseen instances to all instances of her training set.
She makes predictions based on this comparison between unseen and training data.
k-Nearest Neighbors (k-NN) is a type of such methods. Its simplest form is one NN or simply NN.
The NN compares the features of a new unseen observation with the features in all instances in the saved training set.
k-NN finds the closest training instances (i.e., nearest-neighbors) to the unseen instance and assign the class, which is most represented in the set of the neighbors.
The distance metric (i.e., in which variable you measure the distance of neighbors), is crucial aspect of k-NN.
Euclidian distance:
Manhattan distance:
Two important issues:
Scaling: Small difference between instances 1 and 2 in one of the variables can lead to larger distance than that between instances 1 and 3, with instance 3 having larger differences from instance 1 in both variables! So instance 3 will be unlogically classified.
The solution is to use a different scale for the variable in question. Use a scale that will inlfuence the distances so that they will represent realistic differences between the variables of the instances (e.g., human heigh in meters should change to centimeters).
Even better we should normalize the features, i.e., having values between 0 and 1.
Categorical features: How can we incorporate them into a distance metric?
TRUE or FALSE (1 or 0).Let’s return to the tragic titanic dataset. This time, you’ll classify its observations differently, with k-Nearest Neighbors (k-NN). However, there is one problem you’ll have to tackle first: scale.
As you’ve learned the scale of your input variables may have a great influence on the outcome of the k-NN algorithm. In your case, the Age is on an entirely different scale than Sex and Pclass, hence it’s best to rescale first!
For example, to normalize a vector , you could do the following:
Let’s reproduce a train and a test sets from the titanic dataset:
set.seed(1)
# Insert the original dataset, determine variables classes, order instances
# titanic <- titanic_datacamp[with(titanic_datacamp, order(Survived)), ]
titanic_kaggle <- read.csv("./Data/kaggle_titanic.csv", header=TRUE)
titanic <- subset(titanic_kaggle, !is.na(Age), select = c(Survived, Pclass, Sex, Age))
titanic$Survived <- factor(titanic_datacamp$Survived, levels = c("1", "0"))
titanic$Sex <- as.numeric(titanic$Sex)
# Shuffle the observations
n <- nrow(titanic)
shuffled <- titanic[sample(n),]
# Split the dataset into the train and test sets by 70/30
train_indices <- 1:round(0.7 * n)
test_indices <- (round(0.7 * n)+1):n
train <- shuffled[train_indices,]
test <- shuffled[test_indices,]
Follow the instructions below to normalize Age and Pclass for both the training and the test set.
Survived, of both train and test to separate vectors: train_labels and test_labels.train_labels <- train$Survived
test_labels <- test$Survived
train and test set to knn_train and knn_test.knn_train <- train
knn_test <- test
knn_train and knn_test. Tip: dropping a column named column in a data frame named df can be done as follows: df$column <- NULL.knn_train$Survived <- NULL
knn_test$Survived <- NULL
# Alternative commands:
# knn_train <- knn_train[,-1]
# knn_test <- knn_test[,-1]
Pclass. Pclass is an ordinal value between 1 and 3. Have a look at the code that normalizes this variable in both the training and the test set. To define the minimum and maximum, only the training set is used; we can’t use information on the test set (like the minimums or maximums) to normalize the data.min_class <- min(knn_train$Pclass)
max_class <- max(knn_train$Pclass)
knn_train$Pclass <- (knn_train$Pclass - min_class) / (max_class - min_class)
knn_test$Pclass <- (knn_test$Pclass - min_class) / (max_class - min_class)
Age. In a similar fashion, normalize the Age column of knn_train as well as knn_test. Again, you should only use features from the train set to decide on the normalization! You should use the intermediate variables min_age and max_age.min_age <- min(knn_train$Age)
max_age <- max(knn_train$Age)
knn_train$Age <- (knn_train$Age - min_age) / (max_age - min_age)
knn_test$Age <- (knn_test$Age - min_age) / (max_age - min_age)
As you may have noticed, preprocessing data can be a tedious job. Luckily there are packages available for that. Coding it yourself is always good to gain some intuition though. Note that the variable Sex already took values between 0 and 1, hence did not require normalization.
knn() functionNow that you have your preprocessed data (available in your workspace as knn_train, knn_test, train_labels and test_labels) you are ready to start with actually classifying some instances with k-Nearest Neighbors.
To do this, you can use the knn() function which is available from the class package.
Let’s try it out and see what the result looks like.
# Set random seed.
set.seed(1)
class package. Don’t forget to install.packages("class"), if not done so yet!library(class)
knn() to predict the values of the test set based on 5 neighbors. Fill in variables in the function. The prediction result is assigned to pred. The function takes four arguments:
train: observations in the training set, without the class labels, available in knn_train.test: observations in the test, without the class labels, available in knn_test.cl: factor of true class labels of the training set, available in train_labels.k: number of nearest neighbors you want to consider, 5 in our case.pred <- knn(train = knn_train, test = knn_test, cl = train_labels, k = 5)
test_labels and pred, the predicted labels, use table() to build a confusion matrix: conf. Make test_labels the rows in the confusion matrix. Print out the confusion matrix.conf <- table(test_labels, pred)
conf
## pred
## test_labels 1 0
## 1 61 24
## 0 17 112
Let’s now take this to the next level by iterating over different values of k and comparing the accuracies.
A big issue with k-Nearest Neighbors is the choice of a suitable k. How many neighbors should you use to decide on the label of a new observation? Let’s have R answer this question for us and assess the performance of k-Nearest Neighbor classification for increasing values of k.
Again, knn_train, knn_test, train_labels and test_labels that you’ve created before are available in your workspace.
# Set random seed.
set.seed(1)
range, a vector of K values to try, and an accs vector to store the accuracies for these different values. Don’t forget to load the class package.library(class)
range <- 1:round(0.2 * nrow(knn_train))
accs <- rep(0, length(range))
In the for loop:
Use knn() to predict the values of the test set like you did in the previous exercise. This time set the k argument to , the loop index of the for loop. Assign the result to pred.
With test_labels and pred, the predicted labels, use table() to build a confusion matrix.
Derive the accuracy and assign it to the correct index in accs. You can use sum() and diag() like you did before.
for (k in range) {
# Make predictions pred using knn:
pred <- knn(train = knn_train, test = knn_test, cl = train_labels, k = k)
# Construct the confusion matrix: conf
conf <- table(test_labels, pred)
# Calculate the accuracy and store it in accs[k]
accs[k] <- sum(diag(conf))/sum(conf)
}
xlab argument.plot(range, accs, xlab = "k")
which.max() and print it out to the console. Tip: you want to find out which index is highest in the accs vector.which.max(accs)
## [1] 73
For an even more robust estimate of the best k, you could use cross validation that you’ve learned about in Chapter 2.
A cool way to visualize how 1-Nearest Neighbor works with two-dimensional features is the Voronoi Diagram. It’s basically a plot of all the training instances, together with a set of tiles around the points. This tile represents the region of influence of each point. When you want to classify a new observation, it will receive the class of the tile in which the coordinates fall. Pretty cool, right?
In the plot below you can see training instances that belong to either the blue or the red class. Each instance has two features: x and y. The top left instance, for example, has an x value of around 0.05 and a y value of 0.9.
Suppose you are given an unseen observation with features . Looking at the Voronoi diagram, which class would you give this observation?
It’s impossible to tell!
Incorrect. It is possible, try to locate a point with coordinates .
Red
Incorrect. Have another look. Try to tell where a point with coordinates would be located and in which colored region it would lie.
Blue
Correct! The new observation is closest to an observation with the class label ‘blue’.
Receiver Operator Characteristic curve
Both decision trees and k-NN are used to predict the class of an unknown observations. They also provide the probability that the instance belongs to a specific class. For example a decision tree predicts that a patient can be sick or healthy with a probability of 70% over 30%. In order to decide if indeed the patient is trully healthy or sick we must define a probability thershold above which we decide wether the patient is sick or healthy. So if we define that a patient is sick with a chance of 50% or higher then based on the decission tree result we classify the patient as sick. However, in this case we might want to lower the threshold to 30%, to classify more patients as sick and make sure you are not sending sick people at home. But with this treshold more trully healthy patients will be also classified as sick! This threshold is called the decision function, and it is very important to draw the ROC curve.
A recap of the Confusion Matrix, another performance measure for the classification process. It is also important to the ROC curve. Two ratios of the confusion matrix are specifically important: The true positive rate (TPR), and the false positive rate (FPR).
It is constructed with
To draw the curve we need a classifier that outputs probabilities, which is the decision function. In our example it is the probability that helps us deside to diagnose a patient as sick or healthy. For example if a probability of the patient being sick is >= 50% (healthy < 50%), we diagnose the patient as sick. We classify the entire test set of patients, build the confusion matrix, and calculate the TPR and FPR. This becomes one point on the ROC curve. However, we can classify every patient as sick, even is a patient has a 0 chance of being sick. In this case TPR = FPR = 1. This will be another point in the ROC curve. Similarly we can classify all patients as being healthy, in which case TPR = FPR = 0. This again a point on the ROC curve. For every threshold between 0% and 100% we can define TPR and FPR producing one point on the ROC curve.
How do we know if our classifier is any good given the curve? The closser the ROC curve comes to upper-left corner of the gird, the better it is. This corner corresponds to a TPR = 1 and a FPR = 0. This means that good classifiers have big area under their ROC curve. We use this area, which has a maximum of 1, to compare classification models. An Area Under the Curve (AUC) of 0.9 corresponds to a very good classification model.
In this exercise you will work with a medium sized dataset about the income of people given a set of features like education, race, sex, and so on. Each observation is labeled with 1 or 0: 1 means the observation has annual income equal or above $50,000, 0 means the observation has an annual income lower than $50,000 (Source: UCIMLR). This label information is stored in the income variable.
First, we read the prepared train and test datasets. (We also change some classes appropriately).
train <- read.csv("./Data/income_train.csv", header=TRUE, sep = ",")
train$education.num <- as.integer(train$education.num)
train$capital.gain <- as.integer(train$capital.gain)
train$income <- as.integer(train$income)
test <- read.csv("./Data/income_test.csv", header=TRUE, sep = ",")
test$education.num <- as.integer(test$education.num)
test$capital.gain <- as.integer(test$capital.gain)
test$income <- as.integer(test$income)
A tree model, tree, is learned for you on a training set and tries to predict income based on all other variables in the dataset.
# Set random seed.
set.seed(1)
# Build a tree on the training set: tree
library(rpart)
tree <- rpart(income ~ ., train, method = "class")
In previous exercises, you used this tree to make class predictions, by setting the type argument in predict() to "class".
To build an ROC curve, however, you need the probabilities that the observations are positive. In this case, you’ll want to to predict the probability of each observation in the test set (available as well) having an annual income equal to or above $50,000. Now, you’ll have to set the type argument of predict() to "prob".
test set observations using predict(). It takes three arguments:
all_probs.all_probs <- predict(tree, test, type="prob")
all_probs. Ask yourself the question; what kind of data structure is it?head(all_probs)
## 0 1
## 1 0.9458348 0.05416521
## 2 0.6925665 0.30743347
## 3 0.9458348 0.05416521
## 4 0.9458348 0.05416521
## 5 0.9458348 0.05416521
## 6 0.9458348 0.05416521
all_probs, corresponding to the probabilities of the observations belonging to class 1. Assign to probs.probs <- all_probs[,2]
Now that you have the probabilities of every observation in the test set belonging to the positive class (annual income equal or above $50,000), you can build the ROC curve.
You’ll use the ROCR package for this. First, you have to build a prediction object with prediction(). Next, you can use performance() with the appropriate arguments to build the actual ROC data and plot it.
probs, which you had to calculate in the previous exercise, is again coded here.
# Code of previous exercise
set.seed(1)
tree <- rpart(income ~ ., train, method = "class")
probs <- predict(tree, test, type = "prob")[,2]
# install.packages("ROCR")
library(ROCR)
prediction() with probs and the true labels of the test set (in the income column of test) to get a prediction object. Assign the result to pred.pred <- prediction(probs, test$income)
performance() on pred to get the ROC curve. The second and third argument of this function should be "tpr" and "fpr". These stand for true positive rate and false positive rate, respectively. Assign to result to perf.perf <- performance(pred,"tpr","fpr")
plot(perf)
A single look at the ROC curve gives you an idea of the classifier’s quality. If the curve comes close to the upper-left corner, it’s pretty good. In the next exercise we actually quantify what “good” means.
The same package you used for constructing the ROC curve can be used to quantify the area under the curve, or AUC. The same tree model is loaded, and the test set’s probabilities have again been calculated.
# Build tree and predict probability values for the test set
set.seed(1)
tree <- rpart(income ~ ., train, method = "class")
probs <- predict(tree, test, type = "prob")[,2]
Again using the ROCR package, you can calculate the AUC. The use of prediction() is identical to before. However, the performance() function needs some tweaking.
library(ROCR)
prediction() with the probabilities and true labels of the test set to get a prediction object. Assign to pred.pred <- prediction(probs, test$income)
performance() with this prediction object to get the ROC curve. The second argument of this function should be "auc". This stands for area under curve. Assign to perf.perf <- performance(pred, "auc")
perf@y.values[[1]].perf@y.values[[1]]
## [1] 0.8463732
Now you get a single number to assess the quality of your classifier. An AUC of around 0.85 is actually pretty good!
In the following plot you can see two ROC curves. The blue one belongs to a first decision tree model, DT1, and the red one belongs to another decision tree model, DT2.
These curves are formed on the test set used in the previous exercises, i.e. a test set of the income dataset.
Which of the classifiers would have the highest AUC, according to this plot?
DT2 covers smaller area under its curve.In this exercise you’re going to assess two models: a decision tree model and a k-Nearest Neighbor model. You will compare the ROC curves of these models to draw your conclusions.
In this example we have some predictions from two spam filters. These spam filters calculated the probabilities of unseen observations in the test set being spam. The real spam labels of the test set can be found in test$spam.
To build the first spam filter we will use the original UCI repository dataset, where some relevant data for every email are included. This extra information, such as word and character frequencies, will help filter the spam. All of these can be found in the emails dataset inserted below. Also, the training train and testing test sets (with 70/30 precentage) are been built from it.
Let’s produce the datasets.
# Read the original emails dataset from UCI
emails <- read.csv("./Data/emails_uci.csv", header=TRUE, sep = ",")
# Shuffled the observations of the emails dataset and store the result in shuffled.
n <- nrow(emails)
shuffled <- emails[sample(n),]
# Split the dataset into a train set, and a test set with a 70/30 split.
train_indices <- 1:round(0.7 * n)
test_indices <- (round(0.7 * n)+1):n
train <- shuffled[train_indices,]
test <- shuffled[test_indices,]
A given test set will be now loaded into your workspace as test, for which probabilities are already assigned.
test <- read.csv("./Data/email_spam_test.csv", header=TRUE, sep = ",")
It is your job to use your knowledge about the ROCR package to plot two ROC curves, one for each classifier. The assigned probabilities for the observations in the test set are being loaded into the workspace below. That for the decision tree model is called probs_t, and for k-Nearest Neighbors is called probs_k.
probs_k <- read.table("./Data/spam_probs_k-nn.tab", header = FALSE)
probs_t <- read.table("./Data/spam_probs_tree.tab", header = FALSE)
library(ROCR)
probs_t and probs_k are the probabilities of being spam, predicted by the two classifiers. Use prediction() to create prediction objects for probs_t and probs_k. Call them pred_t and pred_k.pred_t <- prediction(probs_t, test$spam)
pred_k <- prediction(probs_k, test$spam)
pred_t and pred_k, to create performance objects. You can use the performance() function for this. The second and third arguments should be "tpr" and "fpr" for both calls. Call them perf_t and perf_k.perf_t <- performance(pred_t,"tpr","fpr")
perf_k <- performance(pred_k,"tpr","fpr")
draw_roc_lines() draws both ROC lines. It takes two arguments: the first is the performance object of the tree model, perf_t, and the second is the performance object of the k-Nearest Neighbor model, perf_k.The function syntax is shown below (btw the source code of an object in R can be aqcuired with getAnywhere()):
draw_roc_lines <- function(tree, knn) {
if (!(class(tree)== "performance" && class(knn) == "performance") ||
!(attr(class(tree),"package") == "ROCR" && attr(class(knn),"package") == "ROCR")) {
stop("This predefined function needs two performance objects as arguments.")
} else if (length(tree@x.values) == 0 | length(knn@x.values) == 0) {
stop('This predefined function needs the right kind of performance objects as
arguments. Are you sure you are creating both objects with arguments "tpr"
and "fpr"?')
} else {
plot(0,0,
type = "n",
main = "ROC Curves",
ylab = "True positive rate",
xlab = "False positive rate",
ylim = c(0,1),
xlim = c(0,1))
lines(tree@x.values[[1]], tree@y.values[[1]], type = "l", lwd = 2, col = "red")
lines(knn@x.values[[1]], knn@y.values[[1]], type = "l", lwd = 2, col = "green")
legend("bottomright", c("DT","KNN"), lty=c(1,1),lwd=c(2.5,2.5),col=c("red","green"))
}}
Let’s use it to plot the curves:
draw_roc_lines(perf_t, perf_k)
Can you interpret these ROC curves? Which classifier works best?
Although a traditional subject in classical statistics, you can also consider regression from a machine learning point of view. You’ll learn more about the predictive capabilities and performance of regression algorithms. At the end of this chapter you’ll be acquainted with simple linear regression, multi-linear regression and k-Nearest Neighbors regression.
Regression is very similar to classification. Instead of trying to predict one class for the observation, with regression we try to estimate an actual value.
Example: Data about a Shop: sales, competitors, district, ….
Rely on a model of this relation.
Simple: There is only one predictor to model the response.
Linear: There is approximately a linear relationship between predictor and response.
The linear relationship is of the form:
There is no line that perfectly fits all data, therefore we do have a statistical error, which idealy will be small.
The goal is to find the line that fits the predictor and response data the best.
The goodness of a fit is measured by the residuals:
The goodness of fit is also measured by the residuals sum of squares (RSS):
The goal of regression is to find the coefficients, and , that minimize RSS. The corresponding line is called the regression line.
In R we estimate the coefficients with the function lm.
my_lm <- lm(sales ~ ads, data = shop_data)
In its arguments it has the response (sales) first and the predictor (ads) second. It returns the estimate coefficients in the model object coefficinets.
my_lm$coefficients
Having a working model allows you to predict new outcomes as:
In R this is done with the function predicition
y_new <- predict(my_lm, x_new, interval = "confidence")
Function prediction comes with several features such as returning confidence intervals (like in the example above), standard error of our prediction, and more.
RMSE: Root Mean Square Error. A yet another accuracy measure, refering to regression.
To minimize the residual sum of squares we start by counting the RMSE:
RMSE is very useful for comparing the rsults of different regressions. The lower the RMSE the better your fit. It is basically the standard deviation of the error. RMSE depends on the unit and scale of the response variable. Therefore, it is very difficult to interpret its value.
The difficulty of interpreting RMSE is solved by comparing the residual sum of squares to the total sum of squares :
This comparison is included in the measure.
R calculates this value with summary() function
summary(my_lm)$r.squared
In your first exercise, you’ll familiarize yourself with the concept of simple linear regression. You are given measures of grey kangaroos’ nose width and length. You can find the data in kang_nose, which is loaded in your workspace in the typical manner:
kang_nose <- read.csv("./Data/kang_nose.csv", header=TRUE, sep = ",")
The dataset has two columns: nose_width and nose_length.
Your job is to describe the linear relationship between the grey kangaroo’s nose width (mm) and nose length (mm). Make use of the lm() function and consult the help file if necessary. Remember to explore your data first!
We caught Skippy and measured its nose width, nose_width_new, but it escaped before we measured its nose length, can you help?
plot function to draw a scatter plot and have a look. Is linearity plausible?plot(kang_nose$nose_width, kang_nose$nose_length, xlab = "nose width", ylab = "nose length")
# or simply
# plot(kang_nose, xlab = "nose width", ylab = "nose length")
lm() to build a linear model, lm_kang, which models the nose_length by the nose_width variable.lm_kang <- lm(nose_length ~ nose_width, data = kang_nose)
coefficients column of lm_kang.lm_kang$coefficients
## (Intercept) nose_width
## 27.893058 2.701175
predict() with the lm_kang model as a first argument to predict the nose length corresponding with the nose width in nose_width_new. You can use nose_width_new as the second argument of predict().# First insert the predictor value in a data frame
# (Important: must be a data frame with the variable column with the same label as that
# used in the model)
nose_width_new <- data.frame(nose_width = 250)
predict(lm_kang, nose_width_new)
## 1
## 703.1869
There is a visible linear pattern between the nose width and length of Grey Kangaroos! With a nose width of 250 mm, your prediction tells us Skippy has a nose length of around 700 mm. Thanks! We can stop chasing Skippy!
Now that you’ve got a grasp on the concept of simple linear regression, let’s move on to assessing the performance. Let’s stick to the Kangaroo example. The dataset, kang_nose, as well as the linear model you built, lm_kang, are still available.
In this exercise, you’ll plot the regression line through the data points. Next, you’ll calculate the residuals,
Lastly, you’ll determine the Root-Mean-Square-Error:
This estimates the standard deviation of the model. Remember, the smaller the RMSE, the better your fit!
kang_nose data with a regression line.lm_kang <- lm(nose_length ~ nose_width, data = kang_nose)
plot(kang_nose, xlab = "nose width", ylab = "nose length")
abline(lm_kang$coefficients, col = "red")
predict() on lm_kang and assign the result to nose_length_est. These values correspond to your estimates .nose_length_est <- predict(lm_kang)
Determine the residuals by subtracting the predicted values from the actual values. Actual values are stored in kang_nose$nose_length and predictions in nose_length_est. Assign them to res.
res <- kang_nose$nose_length - nose_length_est
rmse and print it to the console.rmse <- sqrt(sum(res^2, na.rm = T) / length(res))
rmse
## [1] 43.26288
The regression line passes well through the datapoints. What do you think of the RMSE value: 43mm? High or low? It’s difficult to interpret it if you have no other model to compare with. In the next exercise you’ll address this by calculating the R-squared.
You’ve correctly calculated the RMSE in the last exercise, but were you able to interpret it? You can compare the RMSE to the total variance of your response by calculating the , which is unitless. The closer is to , the greater the degree of linear association is between the predictor and the response variable.
R calculates these performance measures for you. You can display them by applying summary() to your linear model object. Your job is to now manually calculate and compare your value to the value that R calculates automatically.
Here is the residual sum of squares,
whereas is the total sum of squares,
with the sample mean of the response. res, kang_nose and lm_kang are still available in your workspace.
ss_res.# Remember: nose_length_est <- predict(lm_kang)
# res <- kang_nose$nose_length - nose_length_est
ss_res <- sum(res^2, na.rm = T)
nose_lengths and the average, square those values, and then sum() them. Assign the result to ss_tot.# Individual responses: kang_nose$nose_length
# Mean of responses: mean(kang_nose$nose_length)
ss_tot <- sum((kang_nose$nose_length - mean(kang_nose$nose_length))^2)
ss_res/ss_tot. Assign your outcome to r_sq and print it.r_sq <- 1 - (ss_res / ss_tot)
r_sq
## [1] 0.7768914
summary() to generate a summary of lm_kang. Take a look at the outcome multiple R-squared.summary(lm_kang)
##
## Call:
## lm(formula = nose_length ~ nose_width, data = kang_nose)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.876 -32.912 -4.855 30.227 86.307
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.8931 54.2991 0.514 0.61
## nose_width 2.7012 0.2207 12.236 1.34e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.26 on 43 degrees of freedom
## Multiple R-squared: 0.7769, Adjusted R-squared: 0.7717
## F-statistic: 149.7 on 1 and 43 DF, p-value: 1.342e-15
Apart from rounding, the Multiple R-squared is the same as the r_sq you calculated. An R-squared of 0.77 is pretty good!
You are given data on GDP per capita and its relation to the percentage of urban population for several UN countries, measured in the year 2014 (Source: The World Bank). This dataset is stored in a data frame world_bank_train and has two variables: cgdp and urb_pop. Let’s input the dataset.
world_bank_train <- read.csv(file="./Data/world_bank_train.csv", header = TRUE, sep = ",",
colClasses = c("numeric", "numeric"))
Have a look at the data, do you think a relationship between the two is plausible? Try to set up a linear model for the percentage of urban population based on the GDP per capita.
Afghanistan has a GDP per capita of 413 USD, stored in cgdp_afg, but its urban population in 2014 is not known yet. Can you predict the outcome using your model?
# Inserting the data for Afghanistan
cgdp_afg <- data.frame(413) ### or cgdp_afg <- as.data.frame(413)
colnames(cgdp_afg) <- c("cgdp")
lm_wb, which models the urb_pop by the cgdp variable.lm_wb <- lm(urb_pop ~ cgdp, data = world_bank_train)
Make a scatter plot, using plot(). Is a relation plausible? Is linearity plausible?
Add the regression line to your plot using abline(). The first argument should be the coefficients of lm_wb. Set col = "red" to have a red line.
plot(world_bank_train, xlab="GDP per Capita", ylab="Urban population")
abline(lm_wb$coefficients, col = "red")
lm_wb and selecting r.squared from it.summary(lm_wb)$r.squared
## [1] 0.3822067
lm_wb, based on the cgdp_afg input.predict(lm_wb, cgdp_afg)
## 1
## 45.01204
You must admit that your linear model is barely acceptable and far from satisfying. So, R-squared is quite low and the regression line doesn’t seem to fit well, so is the predicted 45% to be trusted? Maybe you can improve it in the next exercise.
In the last exercise, your scatter plot didn’t show a strong linear relationship. You confirmed this with the regression line and .
To improve your model, take a step back and study the nature of the data. The predictor variable is numerical, while the response variable is expressed in percentiles. It would make more sense if there were a linear relationship between the percentile changes of the GDP/capita and the changes in the response.
To obtain an estimation of percentile changes, take the natural logarithm of the GDP/capita and use this as your new predictor variable. A model solution to the previous exercise is included in the editor; up to you to make some changes.
cgdp variable of world_bank_train to log(cgdp) in both plot() and lm(). Remember that R is case sensitive!lm_wb_log <- lm(urb_pop ~ log(cgdp), data = world_bank_train)
summary(lm_wb_log)$r.squared
## [1] 0.5787588
cgdp_afg with log() here, R does this for you.predict(lm_wb_log, cgdp_afg)
## 1
## 25.86759
xlab argument of plot() to "log(GDP per Capita)".plot(urb_pop ~ log(cgdp), data = world_bank_train,
xlab = "log(GDP per Capita)",
ylab = "Percentage of urban population")
abline(lm_wb_log$coefficients, col="red")
Your scatter plot now shows a stronger linear relationship! This model predicts the urban population of Afghanistan to be around 26%, what a huge difference with the 45% you had before! In the next exercise you will compare the R-squared measures for both these linear regression attempts.
In your last exercise, you transformed your predictor wonderfully. Your prediction differed substantially between your two models, so which one should we take? To answer this, you can have a look at the values of both attempts. The first linear model you built, which did not use a transformation, is still available as lm_wb. The second model, where log() was used, is still loaded again as lm_wb_log.
summary(lm_wb)$r.squared
## [1] 0.3822067
summary(lm_wb_log)$r.squared
## [1] 0.5787588
Which of the following statements is true?
lm_wb returns the best prediction, its explained variance is the lowest.
Incorrect. A low explained variance indicates that the degree of linear association isn’t that strong. A linear regression might not be the best option in this case.
lm_wb_log returns the best prediction, its explained variance is the highest.
Correct.
lm_wb returns the best prediction, its explained variance is the highest.
Incorrect. R-squared is lower for lm_wb, hence the variance explained by it is lower than by lm_wb_log.
lm_wb_log returns the best prediction, its explained variance is the lowest.
Incorrect. R-squared is higher for lm_wb_log, hence the variance explained by it is higher than by lm_wb.
The second model clearly is better to predict the percentage of urban population based on the GDP per capita. The type of the second model is called log-linear, as you take the logarithm of your predictor variable, but leave your response variable unchanged. Overall your model still has quite a lot of unexplained variance. If you want more precise predictions, you’ll have to add other relevant variables in a multivariable linear model. Check out the next section to learn more.
sales and ads.my_lm <- lm(sales ~ ads, data = shop_data)
Introducing another variable that will affect sales, like nearby compatition from other shops. This will surely change the sales with increase in compatition will force sales to drop. There is a decreasing relationship between sales and comp, which can be described with a second linear regression model. Which model is best? Choosing one of them would be a waste of information. Instead we should include them both in a single model: Multivariable Linear Regression. In this model we have two predictors instead of one.
The relationship between preditors and response is again linear:
As in simple regression we choose the coeafficients that provide the best linear relationship between the predictors and response. You estimate these coefficients by minimizing the residual sum of squares, just like we did before. The residuals and the corresponding sum of squares are now:
This methodology can be extended to include more predictors.
my_lm <- lm(sales ~ ads + com + ..., data = shop_data)
To assess the closeness of the fit you can use again R-Squared. However, more predictors add more complexity and cost to the model and statistically speaking will return a higher value (lower RMSE) regardless to their relationship to the response.
Therefore, we correct the for adding complexity. This new measure is called the Adjusted , and it is available in R from the summary() of your multivariable regression model.
summary(my_lm)$adj.r.squared
The influence of each predictor is evaluated through the so-called p-values. They are available at the summary() of your model. P-values returned for every predictor are good indicators of the importance of the predictors.
The lower the p-value is the higher is the probability that the corresponding predictor is significant.
For example, in order to have 95% confindence that the predictors have significant influence on the repsonse value the p-value should be . If you want to be 99% confident the p-value should be .
Which level you choose depends on the application. Typically, 0.05 is a good threshold chosen for most of the cases. If p-value is much higher than 0.05, you can say that the influence of the corresponding predictor is very small.
Important. Do not mix up the interpretation of the with the influence of the predictors.
Is building the linear model, get the summary and look at the p-values enough? Not that simple.
qqnorm(my_lm$residuals).It is very important to verify these assumptions.
Alternative tests do exist, but these graphs provide a first decent intuisive staring point.
A multi-linear model for net sales of shops can be constructed based on advertisement and competition. The related dataset is available in your workspace as shop_data.
shop_data <- read.table(file="./Data/shop_data.dat", sep = "", header = T)
In this exercise you’ll add even more predictors: inventory (inv), the size of the district (size_dist) and the shop size (sq_ft).
Your job is to set up this model, verify if the fit is good and finally measure the accuracy. Make sure you interpret the results at every step!
plot() function. The plot should plot sales as a function of inv. Have a look at all three plots that result, is linearity plausible?par(mfrow=c(1,3))
plot(sales ~ sq_ft, data = shop_data, xlab = "Shop size", ylab = "Sales")
plot(sales ~ size_dist, data = shop_data, xlab = "Size of district", ylab = "Sales")
plot(sales ~ inv, data = shop_data, xlab = "Inventory", ylab = "Sales")
lm_shop. Use the formula resp ~ . to include all variables.lm_shop <- lm(sales ~ ., data = shop_data)
summary() of lm_shop. Have a look at the and adjusted and try to interpret them. Is close to 1? Is the adjusted higher than in the example in the previous section (0.906)?summary(lm_shop)$r.squared
## [1] 0.9931795
summary(lm_shop)$adj.r.squared
## [1] 0.9915556
That’s an amazing result for both R-squared and the adjusted version. Your all-in model seems to outclass the one in the previous section. Let’s interpret the p-values in the next exercise.
To further analyze the performance, take a look at the p-values of every predictor. Are they all relevant? Remember that you should verify the assumptions on the error variable before interpreting these results.
There is a shop owner that didn’t participate in the questionnaire, who has caught wind of your amazing model! He asked us to estimate his net sales based on the data he provided. Can you help him out? The data can be found in shop_new loaded below. shop_data and lm_shop are still available in your workspace.
shop_new <- data.frame(2.3, 420, 8.7, 9.1, 10)
colnames(shop_new) <- c("sq_ft", "inv", "ads", "size_dist", "comp")
plot(). You can find the data you need in the residuals and fitted.values components of lm_shop, respectively. Use the comma format by specifying the x and y arguments to plot() explicitly, not with the formula format using ~.plot(lm_shop$fitted.values, lm_shop$residuals, xlab = "Fitted Values", ylab = "Residuals")
qqnorm(). Set ylab to "Residual Quantiles".qqnorm(lm_shop$residuals, ylab = "Residual Quantiles")
lm_shop.summary(lm_shop)
##
## Call:
## lm(formula = sales ~ ., data = shop_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.338 -9.699 -4.496 4.040 41.139
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.85941 30.15023 -0.626 0.538372
## sq_ft 16.20157 3.54444 4.571 0.000166 ***
## inv 0.17464 0.05761 3.032 0.006347 **
## ads 11.52627 2.53210 4.552 0.000174 ***
## size_dist 13.58031 1.77046 7.671 1.61e-07 ***
## comp -5.31097 1.70543 -3.114 0.005249 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.65 on 21 degrees of freedom
## Multiple R-squared: 0.9932, Adjusted R-squared: 0.9916
## F-statistic: 611.6 on 5 and 21 DF, p-value: < 2.2e-16
predict(lm_shop, shop_new)
## 1
## 262.5006
Indeed. There is no clear pattern in your residuals. Moreover, the residual quantiles are approximately on one line. From the small p-values you can conclude that every predictor is important!
Let’s take a different dataset. In 2002 Brisbane did a study on different chocolate bars and measured its energy per 100 gram, percentage of proteins, percentage of fat and the total size. You’re wondering whether energy is related to the other variables. You can find the results in choco_data.
choco_data <- read.table(file="./Data/choco_data.dat", sep = "", header = T)
Your job is to build a multi-linear model for the energy based on all other variables and judge its performance.
par(mfrow=c(1,3))
plot(energy ~ protein, data = choco_data, xlab = "Percentage of proteins", ylab = "Energy")
plot(energy ~ fat, data = choco_data, xlab = "Percentage of fat", ylab = "Energy")
plot(energy ~ size, data = choco_data, xlab = "Chcolate size", ylab = "Energy")
lm_choco, predicting energy from the other variables.lm_choco <- lm(energy ~., data = choco_data)
plot().plot(lm_choco$fitted.values, lm_choco$residuals, xlab = "Fitted Values", ylab = "Residuals")
qqnorm().qqnorm(lm_choco$residuals, ylab = "Residual Quantiles")
summary(lm_choco)
##
## Call:
## lm(formula = energy ~ ., data = choco_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107.084 -35.756 -8.323 36.100 104.660
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1338.3333 40.0928 33.381 < 2e-16 ***
## protein 23.0019 3.6636 6.279 6.97e-08 ***
## fat 24.4662 1.6885 14.490 < 2e-16 ***
## size -0.8183 0.6035 -1.356 0.181
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.2 on 52 degrees of freedom
## Multiple R-squared: 0.9019, Adjusted R-squared: 0.8962
## F-statistic: 159.3 on 3 and 52 DF, p-value: < 2.2e-16
Apparently size is not a good predictor. A good look at the plots and the summary is neccesary for answering the questions that follow.
In the last exercise you built a linear regression model for the energy (per 100 g) in a chocolate bar based on its size, percentage of protein and percentage of fat. Are all these predictors relevant with a 95% confidence? The residual and quantile plots from the previous exercise can tell you. Also the summary() of lm_choco provides this information.
Which of the following statements is true?
The predictors protein and fat are statically insignificant, it is best to remove them.
Incorrect. A p-value indicates that the corresponding predictor is statistically significant with 95% confidence.
The predictor size is statistically insignificant, it is best to remove it.
Correct.
The residuals seem to be dependent, the p-values must be ignored.
Incorrect. If there is no visible pattern in your residual plot, the residuals are assumed to be independent.
The residuals don’t appear to be normally distributed, the p-values must be ignored.
Incorrect. If the quantiles are close to a single line, residuals are assumed to be normally distributed.
Case where the data show a clear pattern that is clearly not linear.
One solution is to transformation. We transform the variables like we did in the case of the Wotld Bank GDP and urban populations data (where we used the logarithm of GDP). It is some times then possible to apply multivariable linear regression to the transformed variables.
Unlike the regressions we’ve seen thusfar non-parametric regression requires any parameters estimation at all.
The underlined idea of the k-NN technique is the same to the one we saw in classification. Only the nearest observations determine the outcome.
Let us assume we are given a training set and a new observation of the predictor.
The first step is to calculate the distances in the predictors between the new observation and the training set.
In the second step select the observations of the training set with the smallest distance to the new observation.
Finally, aggregate the response of these nearest observations and the outcome is your prediction.
The k-NN, unlike linear regression, does not make any assumption on the data. This is practical when you can’t describe the relationship between your predictors and response, but still you want reliable predictions.
However, if you model a real linear relationship, you linear regression model will be far more better predictive than your k-NN model.
For we will get a perfect fit to our dataset but very poor predictions.
If we choose the number of observations in the training set, our model predicts simply the mean of the training, again rendening it useless.
This issue is known as the bias - variance trade off, discussed in Cghapter 2.
A reasonable value of is the 20% of the number of observations in the training set. However, we should experiment with this.
So far we’ve seen how to build our own regression model. We worked on a training set, using it to fine-tune our model by adding or removing predictors.
However, the question that remains is: Does it generalize well? This is an important ML question, where the goal is to have an accurate prediction.
Hold Out Method for Regression
We split our data in a training set and a test set at random.
We build the model using the training set, and we fine-tune it by adding or removing predictors.
When measures such as the and RMSE (calculated within the training set) are satisfactory we apply our train model to the test set.
We compare the estimated responses to the true responses of our test set to obtain the residuals.
We then use these to calculate the RMSE for the test set.
Finally we compare the RMSE of the test set to that obtained for the train set.
A visual difference in the cases of under- and over-fitting are shown in this example in comparison to the actual successful fitting. Successes in fitting, generalizing, and predicting for each case are indicated.
Remember the dataset on GDP per capita and percentage of urban development? You were advised to take the logarithm of GDP per capita. Both the regression line and showed a better result for the log-linear model, lm_wb_log, than for the simple linear model, lm_wb. Don’t forget that in this model you want to predict the percentage of urban population (response), given the GDP per capita (predictor)!
You might be wondering whether you were misguided you and produced an overfit? Your workspace is loaded with world_bank_train set that you’ve used before to build the log-linear model. Now, however, also world_bank_test becomes available. You can use this dataset to test if your model generalizes well.
world_bank_train <- read.csv(file="./Data/world_bank_train.csv", header = TRUE, sep = ",",
colClasses = c("numeric", "numeric"))
str(world_bank_train)
## 'data.frame': 142 obs. of 2 variables:
## $ cgdp : num 666 5936 4619 7574 3647 ...
## $ urb_pop: num 26.3 43.3 56.4 57.6 62.8 ...
world_bank_test <- read.csv(file="./Data/world_bank_test.csv", header = TRUE, sep = ",",
colClasses = c("numeric", "numeric"))
str(world_bank_test)
## 'data.frame': 61 obs. of 2 variables:
## $ cgdp : num 18389 1099 2379 5823 3670 ...
## $ urb_pop: num 39.8 26.7 37.9 46.2 53.5 ...
You’ll be comparing the RMSE for both sets:
rmse_train. You’ll need it later to calculate the ratio between the training and test RMSE.# Build the log-linear model
lm_wb_log <- lm(urb_pop ~ log(cgdp), data=world_bank_train)
# Calculate the RMSE
rmse_train <- sqrt(mean(lm_wb_log$residuals^2))
res_test. You first have to find the ground truth labels and then compare them with predictions you made with lm_wb_log.# The real percentage of urban population in the test set, the ground truth
world_bank_test_truth <- world_bank_test$urb_pop
# The predictions of the percentage of urban population in the test set
# Get the predictors to be used inside the predict() function.
# Note that we input them as data.frame, not vector.
# Note also the naming of the field with the same label as the field to be compared to.
world_bank_test_input <- data.frame(cgdp = world_bank_test$cgdp)
# Predict the ubran population from your lm_wb_log model.
# Note that we do not need to get the logarithm of GDP
world_bank_test_output <- predict(lm_wb_log, world_bank_test_input)
# The residuals: the difference between the ground truth and the predictions
res_test <- world_bank_test_output - world_bank_test_truth
rmse_test.rmse_test <- sqrt(mean(res_test^2))
rmse_train
## [1] 13.86704
rmse_test
## [1] 15.01911
rmse_test / rmse_train
## [1] 1.08308
The test’s RMSE is only slightly larger than the training RMSE. This means that your model generalizes well to unseen observations. You can conclude the logarithm transformation did improve your model. It fits your data better and does a good job at generalizing!
Here we inspect up close how a k-NN algorithm works.
A function my_knn that contains a k-NN algorithm is already coded. Its arguments are:
x_pred: predictor values of the new observations (this will be the cgdp column of world_bank_test)x: predictor values of the training set (the cgdp column of world_bank_train)y: corresponding response values of the training set (the urb_pop column of world_bank_train)k: the number of neighbors (this will be 30).The function returns the predicted values for your new observations (predict_knn).
You’ll apply a k-NN algorithm to the GDP/capita of the countries in world_bank_test to predict their percentage of urban population. Both world_bank_train and world_bank_test sets are still loaded in your workspace.
my_knn function and try to identify the following steps:
dist, the absolute distance between the new observation and your training sample.sort_index, the order of dist, using order(); This function returns the indices of elements in dist, in ascending order. So sort_index[1] will return the index of the smallest distance in dist, sort_index[2] the second smallest, etc.predict_knn[i], the mean() of the values in y that correspond to the k smallest distances. Note that sort_index[1:k] will return the indices of the k smallest elements in dist. y contains the percentages of urban population in the training set.###
# A pre-coded k-nn algorithm
my_knn <- function(x_pred, x, y, k){
m <- length(x_pred)
predict_knn <- rep(0, m)
for (i in 1:m) {
# Calculate the absolute distance between x_pred[i] and x
dist <- abs(x_pred[i] - x)
# Apply order() to dist, sort_index will contain
# the indices of elements in the dist vector, in
# ascending order. This means sort_index[1:k] will
# return the indices of the k-nearest neighbors.
sort_index <- order(dist)
# Apply mean() to the responses of the k-nearest neighbors
predict_knn[i] <- mean(y[sort_index[1:k]])
}
return(predict_knn)
}
###
test_output. If you’re not sure which arguments to use, have a look at the description!test_output <- my_knn(world_bank_test$cgdp, world_bank_train$cgdp,
world_bank_train$urb_pop, 30)
plot(world_bank_train,
xlab = "GDP per Capita",
ylab = "Percentage Urban Population")
points(world_bank_test$cgdp, test_output, col = "green")
Now let’s compare the k-NN results with the linear regression to see which one is the best model!
So now you’ve build three different models for the same data: A simple linear model, lm_wb, a log-linear model, lm_wb_log and a non-parametric k-NN model. This k-NN model is actually simply a function that takes test and training data and predicts response variables on the fly: my_knn().
These objects are all stored in your workspace, as are world_bank_train and world_bank_test.
Have a look at the sample code give below, which shows the first steps for building a fancy plot. In the end, three lines should be plotted that represent the predicted responses for the test set, together with the true responses.
You’ll also calculate the RMSE of the test set for the simple linear, log-linear and k-NN regression. Have a look at the results, which regression approach performs the best?
# Define ranks to order the predictor variables in the test set
ranks <- order(world_bank_test$cgdp)
Plot the original test dataset. Predict the responses with the simple linear model, assign the outcome to test_output_lm and overplot with lines() the corresponding line.
Similar to the linear model, use predict() and assign this prediction to test_output_lm_log. Next, use lines() to add a “red” line representing the log-linear output. Use the same line width.
Predict the responses with k-NN and assign the result to test_output_knn, just like you did in the previous exercise. Again using lines(), add a “green” line of width 2 that represents the results of the k-NN algorithm.
# Make the plot squared.
par(pty="s")
# Scatter plot of test set
plot(world_bank_test,
xlab = "GDP per Capita", ylab = "Percentage Urban Population")
# Predict with simple linear model and add line
test_output_lm <- predict(lm_wb, data.frame(cgdp = world_bank_test$cgdp))
lines(world_bank_test$cgdp[ranks], test_output_lm[ranks], lwd = 2, col = "blue")
# Predict with log-linear model and add line
test_output_lm_log <- predict(lm_wb_log, data.frame(cgdp = world_bank_test$cgdp))
lines(world_bank_test$cgdp[ranks], test_output_lm_log[ranks], lwd = 2, col = "red")
# Predict with k-NN and add line
test_output_knn <- my_knn(world_bank_test$cgdp, world_bank_train$cgdp,
world_bank_train$urb_pop, 30)
lines(world_bank_test$cgdp[ranks], test_output_knn[ranks], lwd = 2, col = "green")
# Calculate RMSE on the test set for simple linear model
sqrt(mean( (test_output_lm - world_bank_test$urb_pop) ^ 2))
## [1] 17.41897
# Calculate RMSE on the test set for log-linear model
sqrt(mean( (test_output_lm_log - world_bank_test$urb_pop) ^ 2))
## [1] 15.01911
# Calculate RMSE on the test set for k-NN technique
sqrt(mean( (test_output_knn - world_bank_test$urb_pop) ^ 2))
## [1] 16.10026
The log-linear model seems to give the best RMSE on the test set! That was all for regression. Time to take your first serious steps in the world of unsupervised learning. But fear not! Gilles is there to guide you!
As an unsupervised learning technique, clustering requires a different approach than the ones you have seen in the previous chapters. How can you cluster? When is a clustering any good? All these questions will be answered; you’ll also learn about k-means clustering and hierarchical clustering along the way. At the end of this chapter and our machine learning video tutorials, you’ll have a basic understanding of all the main principles.
Cluster: Collection of same objects
Similar within cluster
Dissimilar between clusters
Clustering: grouping objects in clusters
No labels: Unsupervised classification
Plenty possible clusters
Pattern analysis
Visualize Data
pre-Proscessing Step
Outlier Detection
Targeted Marketing Programs
Student Segmentations
Data Mining
…
Measure of Similarity:
Numerical Variables Metrics: Eucledian, Manhattan, …
Categorical Variables Construct your own Distance.
Clustering Methods
k-means
Hierarchical Many variations
…
Within Cluster Sum of Squares (WSS):
: Object
: Cluster
: Number of Clusters
Measure of compactness Minimize WSS
Between Clusters Sum of Squares (BSS):
: Sample Mean : Objects in Clusters
: Number of Clusters
Measure of separation Maximize BSS
The Goal is to partition the data in disjoint subsets. It follows four steps:
Randomly assign centroids.
Assign data to the closest centroid.
Move each centroid to the average position.
Repeat steps 2 and 3.
Then the algorithm has converged.
The goal is to find the right that minimizes WSS
The problem is that WSS keeps decreasing as the increases!
The solution to this problem is to fix to the value where WSS starts decreasing slowly. Typically this value corresponds to , where .
Scree Plot helps defining the best . It visualizes the relation between and . For , and decreases as increases, down to a point where the curve becomes somewaht flat, i.e., small decrease for subsequently larger values. This point is the ``elbow’’ of the plot. It determines the best to be used for the k-Means method.
kmeans(). The distance is determined in Euclidean metric. kmeans() takes three arguments:data is the dataset.centers is the starting centroid, or number of clusters.nstart is the number R restarts with different centroids.my_km <- kmeans(data, centers, nstart)
kmeans() objects:my_km$tot.withinss ### WSS
my_km$betweenss ### BSS
In the dataset “seeds” (Source: UCIMLR) there are various metrics for 210 seeds. Remember this dataset from Chapter 2 which you clustered using the k-means method? Well, we’ve found the labels of the seeds. They are stored in the vector seeds_type; there were indeed three types of seeds!
While clusters are made without the use of true labels, if you happen to have them, it is simply interesting to see how well the clusters you made correspond to these true labels.
It’s up to you now to cluster the instances in seeds and compare the resulting clusters with seeds_type.
# Set random seed for reproducability.
set.seed(100)
kmeans(). Set nstart to 20 to let R randomly select the centroids 20 times. Assign the result to seeds_km.# Do k-means clustering with three clusters, repeat 20 times.
seeds_km <- kmeans(seeds, center = 3, nstart= 20)
seeds_km
## K-means clustering with 3 clusters of sizes 77, 72, 61
##
## Cluster means:
## area perimeter compactness length width asymmetry groove_length
## 1 11.96442 13.27481 0.8522000 5.229286 2.872922 4.759740 5.088519
## 2 14.64847 14.46042 0.8791667 5.563778 3.277903 2.648933 5.192319
## 3 18.72180 16.29738 0.8850869 6.208934 3.722672 3.603590 6.066098
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2
## 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
## 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## 2 3 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## 2 2 2 2 2 2 1 1 1 1 2 2 2 2 2 1 3 3
## 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
## 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3
## 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 3
## 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## 3 3 3 3 3 3 2 2 2 2 3 2 2 2 1 1 1 1
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 199 200 201 202 203 204 205 206 207 208 209 210
## 1 1 1 2 1 1 1 1 1 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 195.7453 207.4648 184.1086
## (between_SS / total_SS = 78.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
seeds_km, with seeds_type using the table() function.table(seeds_km$cluster, seeds_type)
## seeds_type
## 1 2 3
## 1 9 0 68
## 2 60 10 2
## 3 1 60 0
plot(). Set the col argument to seeds_km$cluster. And use the comma format by specifying the x and y arguments explicitly, not with the formula format using ~.plot(seeds$width, seeds$length, col = seeds_km$cluster)
If you have a look at the table that got generated, you clearly see three groups with 60 elements or more. Overall, you can say that your formed clusters adequately represent adequately the different types of seeds. These larger groups represent the correspondence between the clusters and the actual types. E.g. Cluster 1 corresponds to seed_type 3.
If you call kmeans() without specifying your centroids, R will randomly assign them for you. In this exercise, you will see the influence of these starting centroids yourself using the seeds dataset.
To compare the clusters of two cluster models, you can again use table(). If every row and every column has one value, the resulting clusters completely overlap. If this is not the case, some objects are placed in different clusters.
# Set random seed for reproducability.
set.seed(100)
kmeans(). Set nstart to 1. Assign the result to seeds_km_1.seeds_km_1 <- kmeans(seeds, 5, nstart = 1)
seeds_km_2 <- kmeans(seeds, 5, nstart = 1)
tot.withinss element of seeds_km_1 and seeds_km_2. Put the first one in the numerator.seeds_km_1$tot.withinss / seeds_km_2$tot.withinss
## [1] 1.006146
table(seeds_km_1$cluster, seeds_km_2$cluster)
##
## 1 2 3 4 5
## 1 0 16 0 0 18
## 2 0 56 0 0 0
## 3 0 0 12 0 0
## 4 0 0 0 18 41
## 5 33 0 3 13 0
As you can see, some clusters remained the same, others have changed. For example, cluster 5 from seeds_km_1 completely contains cluster 1 from seeds_km_2 (33 objects). Cluster 4 from seeds_km_1 is split, 18 objects were put in seeds_km_2’s fourth cluster and 41 in its fifth cluster. For consistent and decent results, you should set nstart or determine a prior estimation of your centroids.
Let’s move on to some new data: school results! You’re given a dataset school_result containing school level data recording reading and arithmetic scores for each school’s 4th and 6th graders. (Source: cluster.datasets package). We’re wondering if it’s possible to define distinct groups of students based on their scores and if so how many groups should we consider?
Here is the dataset
school_result <- read.table(file = "./Data/school_result.dat", header = T)
Your job is to cluster the schools based on their scores with k-means, for different values of . On each run, you’ll record the ratio of the within cluster sum of squares to the total sum of squares. The scree plot will tell you which is optimal!
# Set random seed.
set.seed(100)
str().str(school_result)
## 'data.frame': 25 obs. of 4 variables:
## $ reading.4 : num 2.7 3.9 4.8 3.1 3.4 3.1 4.6 3.1 3.8 5.2 ...
## $ arithmetic.4: num 3.2 3.8 4.1 3.5 3.7 3.4 4.4 3.3 3.7 4.9 ...
## $ reading.6 : num 4.5 5.9 6.8 4.3 5.1 4.1 6.6 4 4.7 8.2 ...
## $ arithmetic.6: num 4.8 6.2 5.5 4.6 5.6 4.7 6.1 4.9 4.9 6.9 ...
ratio_ss, that contains all zeros. You can use rep().ratio_ss <- rep(0, 7)
for-loop that runs over k:
school_km. Save the corresponding ratio tot.withinss to totss in the vector ratio_ss at index k. These values are found in the school_km object.for(k in 1:7){
# Apply k-means to school_result: school_km
school_km <- kmeans(school_result, k, nstart = 20)
# Save the ratio between of WSS to TSS in kth element of ratio_ss
ratio_ss[k] <- school_km$tot.withinss / school_km$totss
}
ratio_ss. Set the type argument in plot() to “b”, connecting the points. Also set the xlab argument to “k”.plot(ratio_ss, type = "b", xlab = "k")
What does the scree plot show? Interpret it in the next exercise!
In the last exercise you made a scree plot, showing the ratio of the within cluster sum of squares to the total sum of squares.
You want to choose such that your clusters are compact and well separated. However, the ratio_ss keeps decreasing as k increases. Hence, if you were to minimize this ratio as function of k, you’d end up with a cluster for every school, which is a meaningless result. You should choose such that when increasing it, the impact on ratio_ss is not significant. The elbow in the scree plot will help you identify this turning point.
Can you tell which of the following values of k will provide a meaningful clustering with overall compact and well separated clusters?
While the elbow is not always unambiguously identified, it gives you an idea of the amount of clusters you should try out.
Not trivial. There is no absolute truth!
No true labels
No true response
The evaluation methods depend on the goal of your ML proceedure. This goal is to determine how compact and separated are your clusters. This information is measurable by the evaluation method you will apply.
WSS and BSS are good indications about your clustering quality.
The underlying idea is to compare the variance within the clusters and the separation between the clusters.
Alternative idea is to measure the diameter of the cluster and the intercluster distance.
Cluster Diameter: , where , are objects in the cluster, the cluster, the distance between the objects. Diameter is a measure of compactness.
Intercluster Distance: , where , are objects in the clusters, , are the clusters, the distance between the objects (in different clusters). This distance is a measure of separation.
Dunn’s Index summarizes the two above-mentioned measures:
High Dunn’s index indicates better separated and more compact clusters!
It is however, expensive in its calculation and a worst-case indicator.
Internal Validation: Based on intrinsic knowledge
BIC Index
Silhouette’s Index
External Validation: Based on previous knowledge
Hulbert’s Correlation
Jaccard’s Coefficient
Libraries: cluster, and clValid
Dunn’s Index:
clusters: cluster partitioning vector
Data: original data
dunn(clusters = my_km, Data = ...)
Metrics are often scale dependent!
Example: Which pair is more similar? (Age, Income, IQ)
According to intuition pair (, ) is closer to each other. However, according to their Euclidean distance pair (, ) is more similar!
The solution to this delema is to rescale the income variable by /1000 $!
Problem: Multiple Variables on different scales
Solution: Standardize your data
Subtract the mean
Divide by the Standard Deviation
scale(data)
Note: Standardizing Different Interpretation
It’s always important to assess the scale of your variables prior to clustering. If your variables are on different scales, it’s best to either re-scale some of them or ultimately standardize them all.
There are four datasets imported in your workspace.
school_result, containing scores (from 0 to 10) of schools.moon_data, containing the distance to the planet (in miles) and the orbit’s period (in earth days).run_record, which contains the Olympic run records for 50 countries for the 100m, 200m … to the marathon. The records are all denoted in seconds.cereal_data, listing the the amount of calories, sodium and potassium several cereals contain.# Import school dataset
school_result <- read.table(file = "./Data/school_result.dat", header = T)
# Import moon dataset (in original form)
moon_data_in <- read.table("./Data/moon_data.dat", header=TRUE, sep = "")
# Create a subset of the original dataset, and get the row names from the original set.
moon_data <- subset(moon_data_in, select = c(distance, period))
rownames(moon_data) <- moon_data_in$X
# Import run_records dataset (in original form)
run_record_in <- read.table("./Data/run_record.dat", header=TRUE, sep = "")
# Create a subset of the original dataset, and get the row names from the original set.
run_record <- subset(run_record_in, select = c(X100m, X200m, X400m, X800m, X1500m, X5000m,
X10000m, marathon))
rownames(run_record) <- run_record_in$X
# Import cereal dataset
cereal_data_in <- read.table("./Data/cereal_data.dat", header=TRUE, sep = "")
cereal_data <- subset(cereal_data_in, select = c(Calories, Potassium, Sodium))
rownames(cereal_data) <- cereal_data_in$X
Which of the these datasets should best be fully standardized?
Before you answer this question, you can take a look at the summary of each dataset
summary(school_result)
## reading.4 arithmetic.4 reading.6 arithmetic.6
## Min. :2.700 Min. :3.000 Min. :4.000 Min. :4.6
## 1st Qu.:3.100 1st Qu.:3.300 1st Qu.:4.500 1st Qu.:5.0
## Median :3.400 Median :3.600 Median :5.100 Median :5.3
## Mean :3.656 Mean :3.708 Mean :5.276 Mean :5.4
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.900 3rd Qu.:5.6
## Max. :5.700 Max. :5.100 Max. :8.200 Max. :6.9
summary(moon_data)
## distance period
## Min. : 5800 Min. : 7.7
## 1st Qu.: 174500 1st Qu.: 52.5
## Median : 365000 Median : 209.0
## Mean : 3187658 Mean : 3678.9
## 3rd Qu.: 5297000 3rd Qu.: 6123.0
## Max. :14700000 Max. :18792.0
summary(run_record)
## X100m X200m X400m X800m
## Min. : 9.78 Min. :19.32 Min. :43.18 Min. :101.4
## 1st Qu.:10.10 1st Qu.:20.17 1st Qu.:44.91 1st Qu.:103.8
## Median :10.20 Median :20.43 Median :45.58 Median :105.6
## Mean :10.22 Mean :20.54 Mean :45.83 Mean :106.1
## 3rd Qu.:10.32 3rd Qu.:20.84 3rd Qu.:46.32 3rd Qu.:108.0
## Max. :10.97 Max. :22.46 Max. :51.40 Max. :116.4
## X1500m X5000m X10000m marathon
## Min. :206.4 Min. : 759.6 Min. :1588 Min. : 7473
## 1st Qu.:213.0 1st Qu.: 788.9 1st Qu.:1653 1st Qu.: 7701
## Median :216.6 Median : 805.2 Median :1675 Median : 7819
## Mean :219.2 Mean : 817.1 Mean :1712 Mean : 8009
## 3rd Qu.:224.2 3rd Qu.: 834.5 3rd Qu.:1739 3rd Qu.: 8050
## Max. :254.4 Max. :1002.0 Max. :2123 Max. :10276
summary(cereal_data)
## Calories Potassium Sodium
## Min. : 50.0 Min. : 15.00 Min. : 0.0
## 1st Qu.:100.0 1st Qu.: 37.50 1st Qu.:145.0
## Median :110.0 Median : 60.00 Median :190.0
## Mean :107.9 Mean : 84.42 Mean :180.5
## 3rd Qu.:110.0 3rd Qu.:110.00 3rd Qu.:220.0
## Max. :160.0 Max. :320.00 Max. :320.0
In the school_result all variables are on a scale from 1 to 10. There’s no difference in scale, so no need for standardizing.
In the moon_data there is a difference in scale, but you could easily rescale this by expressing the distance per thousand miles. Avoid losing the units of your variables, it’ll often make interpretation harder.
The run_record dataset should be fully standardized. The scaling between the run records differ substantially. Moreover, they all share the same unit, so standardizing will not make the interpretation of the clusters any harder!
In the cereal_data all variables are on approximately the same scale. No standardization needed.
You agreed that the run_record data should be standardized prior to clustering, but is it truly that important?
In this exercise you’ll cluster the countries based on their unstandardized records and calculate Dunn’s index. Take your time to check out the dataset, you’ll be using it in the upcoming exercises.
In the instructions that follow, you’ll also create a scatter plot of two variables of the run_record dataset. Note that the clustering is performed based on all the data, not only these two columns!
# Set random seed.
set.seed(1)
str() and summary().str(run_record)
## 'data.frame': 54 obs. of 8 variables:
## $ X100m : num 10.23 9.93 10.15 10.14 10.27 ...
## $ X200m : num 20.4 20.1 20.4 20.2 20.3 ...
## $ X400m : num 46.2 44.4 45.8 45 45.3 ...
## $ X800m : num 106 104 106 104 107 ...
## $ X1500m : num 221 212 215 214 222 ...
## $ X5000m : num 800 776 796 770 878 ...
## $ X10000m : num 1659 1652 1663 1612 1829 ...
## $ marathon: num 7774 7651 7933 7632 8782 ...
run_record in 5 groups, set nstart to 20. Assign your result to run_km.run_km <- kmeans(run_record, 5, nstart = 20)
marathon variable on the x-axis and X100m on the y-axis. Color your points using your cluster. Label your axes! And use the comma format by specifying the x and y arguments to plot() explicitly, not with the formula format using ~.## NOT with the format using "~" as requested.
plot(run_record$marathon, run_record$X100m, xlab = "Marathon Time (sec)",
ylab = "100 m Time (sec)", col=run_km$cluster)
# WITH the format using "~".
# plot(run_record$X100m ~ run_record$marathon, xlab = "Marathon Time (sec)",
# ylab = "100 m Time (sec)", col=run_km$cluster)
dunn() function. Store the result in dunn_km and print it out. This function comes with the vignette package. Call it if installed, or install it if not already!# Install the package if not already
# install.packages("clValid")
library(clValid)
dunn_km <- dunn(distance = NULL, clusters = run_km$cluster, Data = run_record,
method = "euclidean")
dunn_km
## [1] 0.05651773
As you can see in the plot, the unstandarized clusters are completely dominated by the marathon records; you can even separate every cluster only based on the marathon records! Moreover Dunn’s index seems to be quite low.
Compare your results with the standardized version in the next exercise
You expected it already, the unstandardized clusters don’t produce satisfying results. Let’s see if standardizing helps!
Your job is the standardize the run_record dataset and apply the k-means algorithm again. Calculate Dunn’s index and compare the results!
# Set random seed.
set.seed(1)
run_record using scale(). Apply as.data.frame() on the result and assign to run_record_sc.run_record_sc <- as.data.frame(scale(run_record))
run_record_sc in 5 groups; set nstart to 20. Assign your result to run_km_sc.run_km_sc <- kmeans(run_record_sc, 5, nstart = 20)
marathon variable on the x-axis and X100m on the y-axis. Use the original dataset, run_record! Color your points using your cluster. Label your axes!plot(run_record$marathon, run_record$X100m, xlab = "Marathon Time (sec)",
ylab = "100 m Time (sec)", col=run_km_sc$cluster)
run_km, using table(). Set the unstandardized clusters as rows.table(run_km$cluster, run_km_sc$cluster)
##
## 1 2 3 4 5
## 1 0 6 0 0 0
## 2 7 0 10 0 1
## 3 1 0 10 0 13
## 4 2 2 0 0 0
## 5 0 0 0 2 0
dunn() function. Store the result in dunn_km_sc and print it out.dunn_km_sc <- dunn(distance = NULL, clusters = run_km_sc$cluster, Data = run_record_sc,
method = "euclidean")
dunn_km_sc
## [1] 0.1453556
The plot now shows the influence of the 100m records on the resulting clusters! Dunn’s index is clear about it, the standardized clusters are more compact or/and better separated!
k-means algorithm requires to set the number of clusters in advance. WE may however want to know more. For example, what is the hierarchical structure of our data:
Which objects cluster first?
Which clusters pairs merge? When?
Starts from the objects
Builds up a hierarchy of clusters.
The hierarchy depends on the distance you choose to compare your objects.
Bottom-Up Algorithm
Calculate the distances between all objects pairs and store the values in a distance martix.
Put every object in its own cluster.
Find the closest pair of clusters (still objects) and merge them. The number of clusters has been reduced by one.
Compute the distances between the new cluster and the old ones and save them in a new distance matrix.
Repeats steps 2 and 3 until all obejcts are clustered in a single cluster.
Linkage-Methods (how we define distance beween clusters)
We calculate the distance between the clusters based on the distances between their objects.
Simple Linkage: The minimal distance between any member of one cluster to any member of the other cluster.
Complete Linkage: The maximal distance between the objects of the clusters is the distance between the clusters.
Average Linkage: The average distance between the objects of all clusters.
Different linkage methods provide different clusterings and thus different insigh into your data.
Single-linkage can lead to undesirable chaining. However, this chaining effect can be used to identify outliers, data points that are far off, becasue they will merge last.
It tells us which lsuters merge at which point. It is a tree representation of our hierarchy.
At the bottom are the leaves, which are the objects.
At some point they will merge together with other objects or clusters.
The merging is indicated by the horizontal bar.
The height tells you the distance between the objects before the merge.
You can cut the tree at any height and it will return the clusters that have a distance between them greater than the gve height.
Library: stats
Start by building the distance matrix
dist(x, method)
x: dataset, method: distance to be used (Euclidean, manhattan, etc)
hclust() to your diatnce matrix and set the argument method to the linkage method of your choice.hclust(d, method)
d: distance matrix, method: Linkage
Pros
In-depth analysis of how your clusters are formed
Unveil different patterns by applying different Linkage-methods
Cons
Computationally intensive.
It can not undo any clusters that have been formed before.
If faced with high amount of objects, the tree becomes very difficult to read.
Pros
Allows clusters to change (by undoing merges), some time even drastically between steps.
Fast to compute. Less demanding than hierarchical clustering.
Cons
Can be harder to interpret.
It requires to set the ammount of clusters in advanced.
It depends on the starting centroids.
We can compare the separation and compactness of both clustering techniques by calculating the sum of squares, Dunn’s index, or any other measure.
Let’s test our understanding. Based on the introduction to hierarchical clustering above, which of the following statements is false?
Single linkage hierarchical clustering can lead to undesirable chaining of your objects. True
K-means is overall computationally less intensive than bottom-up hierarchical clustering. True
The dendrogram is a tree that represents the hierarchical clustering. It tells you which clusters merged when. True
In complete linkage, the distance between two clusters is taken to be the minimal distance from any member of one cluster to any member of the other cluster. False
Let’s return to the Olympic records example. You’ve already clustered the countries using the k-means algorithm, but this gave you a fixed amount of clusters. We’re interested in more!
In this exercise, you’ll apply the hierarchical method to cluster the countries. Of course, you’ll be working with the standardized data. Make sure to visualize your results!
# Import the run_records dataset, name the rows appropriately.
run_record_in <- read.table("./Data/run_record.dat", header=TRUE, sep = "")
run_record <- subset(run_record_in, select = c(X100m, X200m, X400m, X800m, X1500m, X5000m,
X10000m, marathon))
rownames(run_record) <- run_record_in$X
# Standardize the dataset.
run_record_sc <- as.data.frame(scale(run_record))
run_record_sc using dist(). Assign it to run_dist. dist() uses the Euclidean method by default.run_dist <- dist(run_record_sc, method = "euclidean")
run_dist matrix to cluster your data hierarchically, based on single-linkage. Use hclust() with two arguments. Assign it to run_single.run_single <- hclust(run_dist, method="single")
cutree() at 5 clusters. Assign the result to memb_single.memb_single <- cutree(run_single, k = 5)
Make a dendrogram of run_single using plot(). If you pass a hierarchical clustering object to plot(), it will draw the dendrogram of this clustering.
Draw boxes around the 5 clusters using rect.hclust(). Set the border argument to 2:6, for different colors.
plot(run_single)
rect.hclust(run_single, k = 5, border = c(2:6))
With this linkage it appears the two islands Samoa and Cook’s Islands, who are not known for their sports performances, have both been placed in their own groups. Maybe, we’re dealing with some chaining issues? Let’s try a different linkage method in the next exercise!
The clusters of the last exercise weren’t truly satisfying. The single-linkage method appears to be placing each outlier in its own cluster. Let’s see if complete-linkage agrees with this clustering!
In this exercise, you’ll repeat some steps from the last exercise, but this time for the complete-linkage method. Visualize your results and compare with the single-linkage results. The model solution to the previous exercise is already available:
# Code for single-linkage hierarchical clustering
run_dist <- dist(run_record_sc, method = "euclidean")
run_single <- hclust(run_dist, method = "single")
memb_single <- cutree(run_single, 5)
# Visualization code lines (not necessary to repeat):
# plot(run_single)
# rect.hclust(run_single, k = 5, border = 2:6)
run_dist matrix to cluster your data hierarchically, based on complete-linkage. Use hclust(). Assign it to run_complete.run_complete <- hclust(run_dist, method = "complete")
cutree() at 5 clusters. Assign the result to memb_complete.memb_complete <- cutree(run_complete, 5)
run_complete using plot(). Superimpose boxes with rect.hclust(); set the border argument to 2:6 again.plot(run_complete)
rect.hclust(run_complete, k = 5, border = 2:6)
table().table(memb_single, memb_complete)
## memb_complete
## memb_single 1 2 3 4 5
## 1 27 7 14 0 1
## 2 0 0 0 1 0
## 3 0 0 0 0 1
## 4 0 0 0 0 2
## 5 0 0 0 1 0
Compare the two plots. The five clusters differ significantly from the single-linkage clusters. That one big cluster you had before, is now split up into 4 medium sized clusters. Have a look at the table you generated as well!
So you’ve clustered the countries based on their Olympic run performances using three different methods: k-means clustering, hierarchical clustering with single linkage and hierarchical clustering with complete linkage. You can ask yourself: which method returns the best separated and the most compact clusters?
Let’s use Dunn’s index. Remember, it returns the ratio between the minimum intercluster distance to the maximum intracluster diameter. The dunn() function in R, requires the argument clusters, indicating the cluster partitioning, the Data and a method to determine the distance. In this case, that’s "euclidean", which is the default.
Your job is to calculate Dunn’s index for all three clusterings and compare the clusters to each other. The R objects you calculated in the previous exercises are already available in your workspace.
# run_record_sc, run_km_sc, memb_single and memb_complete are calculated
# in the previous exercises.
# Set random seed.
set.seed(100)
run_km_sc. Assign the result to dunn_km. Use the function dunn().dunn_km <- dunn(clusters = run_km_sc$cluster, Data = run_record_sc, method = "euclidean")
memb_single, and assign the index to dunn_single.dunn_single <- dunn(clusters = memb_single, Data = run_record_sc, method = "euclidean")
memb_complete, this time resulting in a Dunn’s index dunn_complete.dunn_complete <- dunn(clusters = memb_complete, Data = run_record_sc, method = "euclidean")
table(). Place the k-means cluster assignments in the table rows.table(run_km_sc$cluster, memb_single)
## memb_single
## 1 2 3 4 5
## 1 9 0 1 0 0
## 2 6 0 0 2 0
## 3 20 0 0 0 0
## 4 0 1 0 0 1
## 5 14 0 0 0 0
table(). Place the k-means cluster assignments in the table rows.table(run_km_sc$cluster, memb_complete)
## memb_complete
## 1 2 3 4 5
## 1 0 0 8 0 2
## 2 0 0 6 0 2
## 3 20 0 0 0 0
## 4 0 0 0 2 0
## 5 7 7 0 0 0
The table shows that the clusters obtained from the complete linkage method are similar to those of k-means. Now it’s up to you to compare the values of Dunn’s index and decide which method returns the best separated and most compact clusters.
Do you remember what Dunn’s index measured? Let’s put that to the test!
Which of the following statements about your clusterings of the countries in run_record_sc is correct? Use the results of Dunn’s index, found in dunn_km_sc, dunn_single and dunn_complete in your workspace.
dunn_km_sc
## [1] 0.1453556
dunn_single
## [1] 0.2921946
dunn_complete
## [1] 0.1808437
Incorrect. Dunn’s index returns the ratio of minimal intercluster-distance to maximal cluster diameter.
Incorrect. Dunn’s index returns the ratio of minimal intercluster-distance to maximal cluster diameter. Hence, a lower Dunn’s index indicates your minimal intercluster ratio decreased and/or the maximal cluster diameter increased. Therefore, a lower Dunn’s index tells you at least one pair of clusters became less separated and/or all clusters became less compact.
Correct. Indeed, the single-linkage method that caused chaining effects, actually returned the most compact and separated clusters … If you think about, it makes sense. The simple linkage method puts every outlier in its own cluster, increasing the intercluster distances and reducing the diameters, hence giving a higher Dunn’s index. Therefore, you could conclude that the single linkage method did a fine job identifying the outliers. However, if you’d like to report your clusters to the local newspapers, then complete linkage or k-means are probably the better choice!
Incorrect. Since Dunn’s index returns the ratio of minimal intercluster-distance to maximal cluster diameter, a higher Dunn’s index indicates your minimal intercluster ratio increased and/or the maximal cluster diameter decreased. Therefore, a higher Dunn’s index tells you atleast one pair of clusters became more separated and/or all clusters became more compact.
You’ve seen that different clustering methods can return entirely different clusters, each with their own interpretation and uses. It’s time to put your skills, in both programming and interpretation, to the test!
Your client has provided you with a dataset, crime_data, containing info on the crimes committed in each of the 50 US states and the percentage of urban population (Source: Edureka). He’d like you to group the states in 4 clusters. He didn’t specify which similarity to use, but the euclidean distance seems acceptable, don’t you agree?
You decide to try out two techniques: k-means and single-linkage hierarchical clustering. You then want to compare the results by calculating the Dunn’s indices to make a conclusion. Which clustering will you deliver to your client?
# Set random seed.
set.seed(1)
scale(). Call the scaled dataset crime_data_sc.# Import the dataset, name the rows appropriately.
crime_data_in <- read.table("./Data/crime_data.dat", header=TRUE, sep = "")
crime_data <- subset(crime_data_in, select = c(murder, assault, urb_pop, rape))
rownames(crime_data) <- crime_data_in$X
# Have a look at them
head(crime_data)
## murder assault urb_pop rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
# Standardize the dataset.
crime_data_sc <- as.data.frame(scale(crime_data))
# Have a look at the scaled dataset
head(crime_data_sc)
## murder assault urb_pop rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388
## Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144 1.7589234 2.067820292
## Colorado 0.02571456 0.3988593 0.8608085 1.864967207
kmeans() to this scaled dataset. You want 4 clusters. Assign the result to crime_km. Set the nstart argument to 20 to achieve a robust result.crime_km <- kmeans(crime_data_sc, centers = 4, nstart = 20)
Apply single-linkage hierarchical clustering by following these steps:
Calculate the distance matrix using dist(). Call it dist_matrix.
Call hclust() to perform the single-linkage hierarchical clustering, use dist_matrix here. Call the resulting object crime_single.
Cut the tree in 4 clusters with cutree(). Call the result memb_single.
dist_matrix <- dist(crime_data_sc)
crime_single <- hclust(dist_matrix, method = "single", members = NULL)
memb_single <- cutree(crime_single, k=4)
Use dunn() to calculate the the Dunn’s index. You should use crime_km$cluster as a first argument to calculate this for the k-means clustering, call it dunn_km. Use memb_single as a first argument to calculate this for the hierarchical clustering, call it dunn_single.
Print out the results in dunn_km and dunn_single.
dunn_km <- dunn(clusters = crime_km$cluster, Data = crime_data_sc)
dunn_km
## [1] 0.1604403
dunn_single <- dunn(clusters = memb_single, Data = crime_data_sc)
dunn_single
## [1] 0.2438734
You successfully completed the least exercise of the course. Apparently, the sinlge linkage hierarchical clustering provides more compact and well separated clusters than the k-Means. The clustering we would offer to our client is the hierarchical. Its dendrogram is shown below:
plot(crime_single, xlab = "Clusters Distance", ylab = "Clustering Height",
main="US criminal activity Dendrogram")
rect.hclust(crime_single, k = 4, border = 3:6)
Here is the structure of the constructed hierarchical tree:
str(crime_single)
## List of 7
## $ merge : int [1:49, 1:2] -15 -13 -14 -23 -19 -36 -27 -20 -38 -46 ...
## $ height : num [1:49] 0.206 0.35 0.429 0.494 0.5 ...
## $ order : int [1:50] 2 9 5 28 33 24 40 10 42 1 ...
## $ labels : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ method : chr "single"
## $ call : language hclust(d = dist_matrix, method = "single", members = NULL)
## $ dist.method: chr "euclidean"
## - attr(*, "class")= chr "hclust"
This exercise concludes the introductory course on Machine Learning with R.